Senior Site Reliability Engineer, Product - USDS

TikTok

Education
Benefits
Special Commitments
Skills

Responsibilities

About the team The Product Engineering team monitors and maintains the availability of TikTok, including services such as video playback, content discovery/recommendations, live streaming, and customer service feedback. In order to enhance collaboration and cross-functional partnerships, among other things, at this time, our organization follows a hybrid work schedule that requires employees to work in the office 3 days a week, or as directed by their manager/department. We regularly review our hybrid work model, and the specific requirements may change at any time. Responsibilities We are seeking a highly motivated and experienced Site Reliability Engineer to join our growing team. You will be responsible for ensuring the reliability, performance, and scalability of our production systems. You will play a critical role in ensuring our systems are designed and operated with resiliency and high availability in mind. In this role, you will: - Collaborate with cross-functional teams to design, deploy, and operate large-scale, high-availability systems - Develop and maintain automation tools and processes to improve the reliability and efficiency of our systems - Act as a technical lead for SRE-related initiatives, providing guidance and mentorship to team members - Work closely with software engineers to diagnose and resolve production issues - Continuously monitor and evaluate the health of our systems, proactively identifying and addressing potential issues before they become problems - Participate in an on-call rotation to provide 24/7 support for production systems - Drive innovation and improvement in our infrastructure and processes through experimentation and research - Participate in the design and implementation of disaster recovery plans

Qualifications

Minimum Qualifications 1. Bachelor or above degree in Computer Science or a related technical discipline 2. 5+ years experience in Site Reliability Engineering, Production Engineering or similar role, working with large-scale distributed systems 3. Strong understanding of containers and container orchestration tools such as Docker and Kubernetes 4. In-depth knowledge of Unix/Linux systems administration, network fundamentals and storage systems 3. Proficiency in one or more programming languages, such as C, C++, Java, Python, Go, Ruby, Rust, JavaScript 7. Strong analytical and problem-solving skills Candidates for this position must be legally authorized to work in the United States. This position is not eligible for visa sponsorship or support.

Read Full Description
Confirmed 2 hours ago. Posted 30+ days ago.

Discover Similar Jobs

Suggested Articles