Machine Learning Engineer - Machine Learning Infrastructure

ByteDance

Responsibilities

The mission of our AML team is to push the next-generation AI infrastructure and recommendation platform for the ads ranking, search ranking, live & ecom ranking in our company. We also drive substantial impact on core businesses of the company. Currently, we are looking for Machine Learning Engineer - Machine Learning Infrastructure to join our team to support and advance that mission.

Responsibilities:

  • Responsible for the design and implementation of a global-scale machine learning system for feeds, ads and search ranking models.
  • Responsible for improving use-ability and flexibility of the machine learning infrastructure.
  • Responsible for improving the workflow of model training and serving, data pipelines, storage system and resource management for multi-tenancy machine learning systems.
  • Responsible for designing and developing key components of ML infrastructure and mentoring interns.

Job requirements

  • Proficient in at least one programming language such as Go/Python in Linux environment, with excellent coding skills.
  • Familiar with open source distributed scheduling/orchestration/storage frameworks, such as Kubernetes (K8S), Yarn (Flink, MapReduce), Mesos, Celery, HDFS, Redis, S3, etc., with rich practical experience in machine learning system development.
  • Master the principle of distributed systems and participate in the design, development and maintenance of large-scale distributed systems.
  • Possess excellent logical analysis ability, able to perform reasonable abstraction and decomposition of business logic.
  • Have a strong sense of responsibility, good learning ability, communication ability and self-motivation, and be able to respond and act quickly.
  • Have good working document habits, and write and update work flow and technical documents in a timely manner as required.

Qualifications

Minimum Qualifications:

  • Experience contributing to an open sourced machine learning framework (TensorFlow/PyTorch).
  • Experience in big data frameworks (e.g., Spark/Hadoop/Flink), experience in resource management and task scheduling for large scale distributed systems.
  • Experience in using/designing open-source machine learning lifecycle management systems: TFX
Read Full Description
Confirmed 12 hours ago. Posted 30+ days ago.

Discover Similar Jobs

Suggested Articles