About the team The USDS TikTok Search Infra SRE team works with engineering and product teams to build and run large-scale, globally distributed, observable, fault-tolerant systems. SREs on this team will deliver on production ownership and be responsible for observability and automation across complex, large-scale service mesh architectures. Responsibilities: • Engage in and improve the whole lifecycle of Search systems — from system design consulting through to launch reviews, deployment, operation and refinement • Deliver tools/software to improve the reliability and scalability of services, automate operations and improve R&D efficiency • Build availability of large-scale services deployed across global data centers • Plan, manage and optimize cloud resources utilization, ensuring SLA of large-scale clusters • Measure and monitor availability, latency and overall service health • Practice sustainable incident response and postmortems.
Minimum Qualifications • Bachelor's degree or above majoring in Computer Science or related fields, with at least 2 years + of related work experience • Experience in SRE of large-scale systems deployment with high reliability and scalability • Familiar with system operation skills in Linux and network • Experience programming in at least one of the following languages: Python, Perl, Go, or C/C++ • Experience in designing, analyzing and troubleshooting large-scale distributed systems • Familiar with popular CI/CD procedures and environments • Effective communication skills and a sense of ownership and drive
Read Full Description