Company Description
Visa is a world leader in payments and technology, with over 259 billion payments transactions flowing safely between consumers, merchants, financial institutions, and government entities in more than 200 countries and territories each year. Our mission is to connect the world through the most innovative, convenient, reliable, and secure payments network, enabling individuals, businesses, and economies to thrive while driven by a common purpose – to uplift everyone, everywhere by being the best way to pay and be paid.
Make an impact with a purpose-driven industry leader. Join us today and experience Life at Visa.
Job Description
The Opportunity
We are looking for talented, curious, and energetic Sr. Site Reliability Engineer who embrace solving complex challenges on a global scale. As a Visa Sr. SRE, you will be responsible for supporting various digital projects and ensuring the reliability and stability of our production transaction processing systems. These systems are the backbone of our business and process millions of transactions daily.
The Work itself
The Product Reliability Engineering (PRE) group prides itself in keeping the applications and systems of Visa up and running to cater to the 24x7 needs of the business.
- Provide Support for critical applications, ensuring their stability and reliability by performing proactive maintenance activities and responding to alerts.
- Engage in automation activities to improve operational efficiency and reduce manual intervention.
- Support application and infrastructure built on modern technologies such as Kubernetes containers, Kafka, Grafana, Prometheus, and Elasticsearch.
- Perform root cause analysis and remediation for incidents impacting application stability and performance.
- Monitor application performance (e.g., memory usage, logging, latency) and take corrective actions as needed.
- Write and maintain scripts for monitoring system activity, including application smoke test activities during pre- and post-production implementations.
- Support application deployments and code releases in test and production environments using industry-standard deployment tools (e.g., Chef, Jenkins).
- Respond to and resolve client-escalated issues related to applications (e.g., increased latency, transactional issues, features not working as expected).
- Implement and maintain performance monitoring dashboards using industry-standard tools (e.g., Splunk, ThousandEyes, Keynote, Runscope, Ghost Inspector, Evolven, Graphite).
- Participate in on-call rotation to provide 24/7 support for production environments.
- Document incident resolutions, troubleshooting steps, and best practices to improve team knowledge and onboarding.
- Collaborate with development, infrastructure, and product teams to resolve complex issues.
- Support disaster recovery and business continuity exercises as required.
- Assist in managing and executing change, incident, and problem management processes.
- Provide regular status updates and incident reports to management and stakeholders.
- Maintain and update runbooks and standard operating procedures for application support.
- Participate in knowledge transfer sessions and contribute to team training initiatives.
- Assist with user access management, including provisioning and de-provisioning access as per company policies.
- Support certificate renewals, patch management, and vulnerability remediation activities as required.
Essential Functions
The Skills You Bring
- Collaboration and Teamwork: The Candidate should possess robust interpersonal skills and are adept at both written and verbal communication. Candidate should have strong inclination towards teamwork and can effectively collaborate with a globally dispersed virtual team.
- Learning Capacity: Fast learner, readily picking up new technologies and tools, and can disseminate this knowledge to others.
- Adaptability and Innovation: Comfortable pushing boundaries and exploring beyond traditional solutions. Embrace challenges, new technologies, and innovation.
- Decision-making: The candidate should have the ability to prioritize, multitask, and deliver quality work on time. Candidate can effectively plan and make informed decisions on execution timelines and maintain focus under stressful situations.
- Personal and Professional Growth: The candidate is a highly self-motivated and possess a strong sense of ownership. Candidate should have a keen interest in learning new technologies and business concepts to facilitate personal and organizational growth.
- Professional Ethics: Candidate must have strong business ethics, self-discipline, and trustworthiness, particularly when handling highly sensitive and confidential data in a live production environment.
This is a hybrid position. Expectation of days in office will be confirmed by your hiring manager.
Qualifications
Basic Qualifications
- 2+ years of relevant work experience and a Bachelors degree, OR 5+ years of relevant work experience
Preferred Qualifications
- Bachelor’s or Master’s degree in Computer Science or a related field from an accredited university.
Experience
- 3–5 years of hands-on experience in a Site Reliability Engineering (SRE) or DevOps role.
- Proven experience supporting applications on public cloud platforms (AWS, GCP, Azure) and working with hybrid cloud/on-premises models.
- Strong experience with the full software development lifecycle (SDLC) and Agile methodologies.
- Experience participating in release management and on-call support for both cloud and on-premises technologies.
Technical Skills
- Proficiency with Linux/UNIX operating systems.
- Strong hands-on expertise with containers and orchestration tools, especially Kubernetes and Docker.
- Working knowledge of database technologies such as MySQL, MSSQL, or Oracle.
- Advanced scripting skills in UNIX Shell, PowerShell, Python, or similar languages.
- Proficient in troubleshooting applications across middleware stacks (Tomcat, Apache, Kafka, MQ) and streaming services (Flink, Spark).
- Experience implementing and managing DevOps pipelines using Jenkins, Ansible, Docker, and Kubernetes.
- Familiarity with application development and troubleshooting in Go and Rust, and with resolving system integration issues.
- Practical experience designing and implementing CI/CD processes for seamless deployments.
- Solid understanding of monitoring and observability tools such as Prometheus, Splunk, and Grafana.
- Hands-on experience creating deployments, services, and ingress flows for applications in Kubernetes clusters.
Soft Skills
- Excellent problem-solving and troubleshooting abilities with a strong attention to detail.
- Effective prioritization, coordination, and multitasking skills in a fast-paced environment.
- Strong collaboration skills - Able to work as part of a cross-functional team.
- Excellent written and verbal communication skills.
- Fast learner with the ability to quickly adopt new technologies and industry trends.
- Adherence to company’s Work From Office policy.
Additional Information
Visa is an EEO Employer. Qualified applicants will receive consideration for employment without regard to race, color, religion, sex, national origin, sexual orientation, gender identity, disability or protected veteran status. Visa will also consider for employment qualified applicants with criminal histories in a manner consistent with EEOC guidelines and applicable local law.
Read Full Description