The USDS TikTok Product Engineering SRE team works with engineering and product teams to build, maintain and run large-scale, globally distributed, observable, fault-tolerant systems. SREs on this team will deliver on production ownership and be responsible for observability and automation across complex, large-scale service mesh architectures. In order to enhance collaboration and cross-functional partnerships, among other things, at this time, our organization follows a hybrid work schedule that requires employees to work in the office 3 days a week, or as directed by their manager/department. We regularly review our hybrid work model, and the specific requirements may change at any time. Responsibilities: - Provide technical leadership and mentorship to a team of Site Reliability Engineers focused on building observable, fault-tolerant systems - Drive architectural decisions for large-scale, globally distributed service mesh architectures - Establish and maintain production ownership models, incident response protocols, and service level objectives - Develop strategic roadmaps for observability and automation initiatives that enhance system reliability - Balance technical contributions with people management responsibilities, including career development, performance evaluations, and team growth - Foster a culture of reliability, continuous improvement, and knowledge sharing within your team and across the organization - Lead security initiatives to safeguard critical assets, partnering with security and compliance teams to implement robust protocols that ensure data protection and regulatory compliance across all services
Minimum Qualifications: - 5+ years of experience and expertise in designing, analyzing, and troubleshooting large-scale distributed systems, relational databases, caching solutions and web service frameworks - Previous experience leading a small to mid-size team while maintaining significant "hands-on" technical contributions - Strong understanding of Unix/Linux operating systems internals and networking fundamentals - Proficiency in writing production-grade code in Go, Python, Java or similar languages - Proven track record of establishing and implementing SRE best practices across engineering organizations - Experience developing and maintaining service level objectives (SLOs) and error budgets Preferred Qualifications: - Deep expertise in algorithms, data structures, and systems design with proven ability to architect complex technical solutions - Track record of developing sophisticated automation tools and developer-friendly APIs that streamline operations and eliminate toil - Exceptional analytical mindset with demonstrated success solving intricate technical problems across distributed systems - Extensive experience running high-availability web services at massive scale, with comprehensive knowledge of cloud-native architectures and advanced networking concepts - Proven ability to lead and collaborate effectively with globally distributed engineering teams across multiple time zones and cultural contexts - Strategic vision to balance immediate operational needs with long-term reliability and scalability objectives - Success in designing and implementing observability solutions for complex distributed systems
Read Full Description