AI智算平台运维工程师

Lenovo

Why Work at Lenovo

We are Lenovo. We do what we say. We own what we do. We WOW our customers.

Lenovo is a US$57 billion revenue global technology powerhouse, ranked #248 in the Fortune Global 500, and serving millions of customers every day in 180 markets. Focused on a bold vision to deliver Smarter Technology for All, Lenovo has built on its success as the world’s largest PC company with a full-stack portfolio of AI-enabled, AI-ready, and AI-optimized devices (PCs, workstations, smartphones, tablets), infrastructure (server, storage, edge, high performance computing and software defined infrastructure), software, solutions, and services. Lenovo’s continued investment in world-changing innovation is building a more equitable, trustworthy, and smarter future for everyone, everywhere. Lenovo is listed on the Hong Kong stock exchange under Lenovo Group Limited (HKSE: 992) (ADR: LNVGY).

This transformation together with Lenovo’s world-changing innovation is building a more inclusive, trustworthy, and smarter future for everyone, everywhere. To find out more visit www.lenovo.com, and read about the latest news via our StoryHub.

Description and Requirements

岗位职责

  • 平台运维与稳定性保障:负责AI智算平台(含GPU集群、分布式存储、高速网络)的日常运维、监控及故障响应,确保系统高可用性(SLA≥99.9%)。
  • 优化算力资源调度策略,提升GPU集群利用率,支持大模型训练/推理任务。
  • 自动化运维体系建设:开发运维脚本(Shell/Python),部署自动化工具(Ansible/Kubernetes),实现配置管理、日志分析及CI/CD流程。
  • 参与智能运维(AIOps)系统开发,集成异常检测与根因分析能力。
  • 跨团队协作与新技术落地:与算法团队协作,解决AI框架(如TensorFlow/PyTorch)与底层硬件的兼容性问题。
  • 探索RDMA网络、液冷技术等创新方案,降低智算中心PUE值。

岗位要求

  • 教育背景:专科及以上学历,计算机、网络工程、电子工程等相关专业。
  • 经验要求:3年以上智算中心或云计算平台运维经验,熟悉AI硬件(NVIDIA GPU/华为昇腾)及高速网络协议(InfiniBand/RoCE)。
  • 技术能力: 精通Linux系统管理、容器化技术(Docker/Kubernetes)及监控工具(Prometheus/Zabbix)。
  • 熟悉AI计算框架的底层依赖(如CUDA、NCCL),具备性能调优经验。
  • 掌握Python/Go开发能力,可独立编写自动化运维脚本。

Additional Locations:

  • China - Hubei - 武汉(Wuhan)
  • China
  • China - Hubei
  • China - Hubei - 武汉(Wuhan)
Read Full Description
Confirmed 5 hours ago. Posted 30+ days ago.

Discover Similar Jobs

Suggested Articles