Lead Site Reliability Engineer
Reporting to: Sr Manager, Availability Management
Office Location: Chennai, India
Flexible Working: Hybrid (Part Office/Part Home)
Cloud Site Reliability Engineer Responsibilities
- On-board internal customers to our 24x7 Applications Support and Enterprise Status Page services
- Be involved with creating an SRE culture globally by defining monitoring strategies and best practices at the organization.
- Monitor application performance and have the ability to provide recommendations on increasing the observability of applications and platforms.
- Play an important role in the Continual Service Improvement process, identifying and driving improvement
- Be instrumental to developing standards, guides to assist the business in maximizing their use of common tools .
- Participate in code peer reviews and enforce quality gates to ensure best practices are followed.
- Apply automation to tasks which would benefit from this. Automating repetitive tasks and deploying monitors via code are core examples.
- Document knowledge gained from engagements in the forms of runbooks and other information critical to incident response.
- Exploring and applying Artificial Intelligence to enhance operational processes/procedures
Should-Haves - Skills & Experience
- Strong skills with modern monitoring tools and demonstrable knowledge of APM, RUM and/or synthetic testing.
- Experience working with observability tools such as Datadog, NewRelic, Splunk, CloudWatch, AzureMonitor
- Experience with the OpenTelemetry (OTEL) Standard
- Working knowledge of at least one programming language, such as Python, JavaScript (NodeJS, etc), Golang or others.
- Strong experience with IaC tools, such as Terraform and Cloudformation.
- Experience with cloud environments, especially AWS and/or Azure.
- Good customer interaction skills and able to understand their needs and expectations.
- Strength in conviction, able to encourage adoption to a wide audience but comfortable with mandating where necessary
- Experience with code quality tools, such as SonarQube.
- Knowledge on code linters tools of various programming languages.
- Experience with CI/CD tools. Such as Bamboo, Jenkins, Azure DevOps, Github actions.
- ITIL experience with basic understanding on incident management, problem management and change management.
Nice-to-Haves - Skills & Experience
- Any cloud certification
- ITIL certifications
- Experience with ITSM tools
- Experience using On-Call Management Tooling
No travel required
Read Full Description