Research Engineer Intern (Doubao (Seed) - Machine Learning System) - 2025 Summer (MS)

ByteDance

Responsibilities

Team Intro:

The Seed Infrastructures team oversees the distributed training, reinforcement learning framework, high-performance inference, and heterogeneous hardware compilation technologies for AI foundation models.

We are looking for talented individuals to join us for an internship in 2025. Internships at Bytedance aim to offer students industry exposure and hands-on experience. Turn your ambitions into reality as your inspiration brings infinite opportunities at ByteDance.

Internships at ByteDance aim to provide students with hands-on experience in developing fundamental skills and exploring potential career paths. A vibrant blend of social events and enriching development workshops will be available for you to explore. Here, you will utilize your knowledge in real-world scenarios while laying a strong foundation for personal and professional growth. This Internship Program runs for 12 weeks beginning in May/June 2025.

Candidates can apply to a maximum of two positions and will be considered for jobs in the order you apply. The application limit is applicable to Bytedance and its affiliates' jobs globally. Applications will be reviewed on a rolling basis - we encourage you to apply early.

Responsibilities

  • Research and develop our efficient machine learning systems, including efficient optimizers, parameters, and gradient efficient training with rank reduction and communication compression.
  • Develop a state-of-the-art asynchronous training framework ensuring convergence.
  • Implement both general purpose training framework features and model specific optimizations (e.g. LLM, diffusions).
  • Improve efficiency and stability for extremely large scale distributed training jobs.

Qualifications

Minimum Qualifications

  • Currently pursuing a MS in Software Development, Computer Science, Computer Engineering, or a related technical discipline.
  • Familiar with machine learning algorithms and platforms
  • Familiar with the C/C++ and Python development in Linux environments
  • Familiar with at least one deep learning framework (TensorFlow, PyTorch, MXNet, or other)
  • Ability to work independently and complete projects from beginning to end and in a timely manner.
  • Good communication and teamwork skills to clearly communicate technical concepts with other teammates.
  • Must obtain work authorization in country of employment at the time of hire, and maintain ongoing work authorization during employment.

Preferred Qualifications

  • GPU based high performance computing, RDMA high performance network (MPI, NCCL, ibverbs).
  • Distributed training framework optimizations such as DeepSpeed, FSDP, Megatron, GSPMD.
  • Familiar with AI compiler stacks such as torch.fx, XLA and MLIR.
  • Experiences in large-scale data processing and parallel computing.
  • Experiences in designing and operating large-scale systems in cloud computing or machine learning.
  • Experiences in in-depth CUDA programming and performance tuning (cutlass, triton).
  • Published papers at top conferences
Read Full Description
Confirmed 12 hours ago. Posted 30+ days ago.

Discover Similar Jobs

Suggested Articles