Senior Site Reliability Engineer, Product - USDS

TikTok

Education
Benefits
Special Commitments
Skills

Responsibilities

About the team The Product Engineering team monitors and maintains the availability of TikTok, including services such as video playback, content discovery/recommendations, live streaming, and customer service feedback. In order to enhance collaboration and cross-functional partnerships, among other things, at this time, our organization follows a hybrid work schedule that requires employees to work in the office 3 days a week, or as directed by their manager/department. We regularly review our hybrid work model, and the specific requirements may change at any time. Responsibilities We are seeking a highly motivated and experienced Site Reliability Engineer to join our growing team. You will be responsible for ensuring the reliability, performance, and scalability of our production systems. You will play a critical role in ensuring our systems are designed and operated with resiliency and high availability in mind. In this role, you will: - Collaborate with cross-functional teams to design, deploy, and operate large-scale, high-availability systems - Develop and maintain automation tools and processes to improve the reliability and efficiency of our systems - Act as a technical lead for SRE-related initiatives, providing guidance and mentorship to team members - Work closely with software engineers to diagnose and resolve production issues - Continuously monitor and evaluate the health of our systems, proactively identifying and addressing potential issues before they become problems - Participate in an on-call rotation to provide 24/7 support for production systems - Drive innovation and improvement in our infrastructure and processes through experimentation and research - Participate in the design and implementation of disaster recovery plans

Qualifications

Minimum Qualifications 1. Bachelor or above degree in Computer Science or a related technical discipline 2. 5+ years experience in Site Reliability Engineering, Production Engineering or similar role, working with large-scale distributed systems 3. Strong understanding of containers and container orchestration tools such as Docker and Kubernetes 4. In-depth knowledge of Unix/Linux systems administration, network fundamentals and storage systems 3. Proficiency in one or more programming languages, such as C, C++, Java, Python, Go, Ruby, Rust, JavaScript 7. Strong analytical and problem-solving skills Candidates for this position must be legally authorized to work in the United States. This position is not eligible for visa sponsorship or support.

Read Full Description
Confirmed 22 hours ago. Posted 30+ days ago.

Discover Similar Jobs

Suggested Articles