Senior Software Engineer - Compute Infrastructure (Orchestration & Scheduling)

ByteDance

Responsibilities

About the Team

The Compute Infrastructure - Orchestration & Scheduling team uses Kubernetes and Serverless technologies to build a large, reliable, and efficient compute infrastructure. This infrastructure powers hundreds of large-scale clusters globally, with over millions of online containers and offline jobs daily, including AI and LLM workloads. The team is dedicated to building cutting-edge, industry-leading infrastructure that empowers AI innovation, ensuring high performance, scalability, and reliability to support the most demanding AI/LLM workloads.The team is also dedicated to open-sourcing key infrastructure technologies, including projects in the K8s portfolio such as kubewharf, Serverless initiatives like Ray on K8s, and LLM inference control plan project AiBrix.

At ByteDance, as we expand and innovate, powering global platforms like TikTok and various AI/ML & LLM initiatives, we face the challenge of enhancing resource cost efficiency on a massive scale within our rapidly growing compute infrastructure. We're seeking talented software engineers excited to optimize our infrastructure for AI & LLM models. Your expertise can drive solutions to better utilize computing resources (including CPU, GPU, power, etc.), directly impacting the performance of all our AI services and helping us build the future of computing infrastructure. Also, with the goal of growing compute infrastructure in overseas regions, including North America, Europe, and Asia Pacific, you will have the opportunities of working closely with leaders from ByteDance’s global business units to ensure that we continue to scale and optimize our infrastructure globally.

Responsibilities

  • Engineer hyper-scale cluster management: Enhance Kubernetes-based cluster platforms to deliver exceptional performance, scalability, and resilience—powering resource management across ByteDance’s massive global infrastructure.
  • Innovate on core scheduling capabilities: Design and maintain a truly unified scheduling that powers diverse workloads (Containers & VMs, online services, offline computing, AI/ML, CPU/GPU workloads, etc) in a massive-scale resource pool.
  • Develop an intelligent scheduling system: Leverage AI models to optimize workload performance and resource utilization across heterogeneous resources—including CPU, GPU, memory, network, and power across global data centers.
  • Lead Infrastructure for Next-Gen ML Workloads: Design and drive the evolution of compute platforms purpose-built for fast, reliable, and cost-effective ML and LLM training/inference.
  • Deliver Quality and Innovation: Write high-quality, maintainable code, and stay at the forefront of open-source and research advancements in AI, ML, systems, and Serverless technologies.

Qualifications

Minimum Qualifications

  • B.S./M.S, degree in Computer Science, Computer Engineering or a related area with 3+ years of relevant industry experience; new graduates with Ph.D. degree and strong publication records can be an exception.
  • Solid understanding of at least one of the following fields: Unix/Linux environments, distributed and parallel systems, high-performance networking systems, developing large scale software systems
  • Proven experience designing, architecting and building cloud and ML infrastructure related but not limited to resource management, allocation, job scheduling and monitoring.
  • Familiarity with container and orchestration technologies such as Docker and Kubernetes.
  • Proficiency in at least one major programming language such as Python, Go, C++, Rust, and Java.

Preferred Qualifications

  • Experience in one large scale cluster management systems, e.g., Kubernetes, Ray, Yarn, or Mesos
  • Experience in large scale resource efficiency management and job scheduling development
  • Project experience in application scaling, workload co-location, and isolation enhancement
  • Experience with a public cloud provider (AWS, Azure and GCP), and their ML services (e.g., AWS SageMaker, Azure ML, GCP Vertex AI).
  • Great communication skills and the ability to work well within a team and across engineering teams.
  • Passionate about system efficiency, quality, performance and scalability
Read Full Description
Confirmed 2 hours ago. Posted 30+ days ago.

Discover Similar Jobs

Suggested Articles