Apply

10,000+ Similar Jobs

NVIDIA's NGC (NVIDIA Gpu Cloud) team is looking for highly motivated Linux System Administrator/DevOps engineers to develop, implement and operate a global, dynamic, state-of-the-art Service Reliability Operations Center, to provide extraordinary levels of support for our Compute Infrastructure and services. As a key member of the Compute Infrastructure Support (CIS) team, you will partner with other key members of our organization including Site Reliability Engineering, Security Operations Center, DevOps teams, and other datacenter operations partners to help make our services capable of providing near 100% availability. On the rare occasion that an incident occurs, you will be our front line to decrease the frequency and duration of any issue. Working in partnership with the development community the team will develop monitors, alarms and alerts to help make the service more reliable and improve our customer experience. Additionally, you will be very involved in supporting the technologies that we will use in the CIS team to help monitor, run and measure the effectiveness of the Compute environment.

What you will be doing:

The team will provide their services 24/7 with a follow-the-sun environment which will span continents.
You will directly report to a manager in Bangalore.
Each team member will need to work either a Saturday or Sunday each week. The shift includes India day hours to ensure that the combination of the US and India teams provide 24/7 coverage.
The heart of the Compute Infrastructure Support team will be monitoring and triaging a growing On-prem and CSP (Cloud Service Provider) production compute and storage Server fleet in a large Datacenter environment.
Every team member will utilize alerts and alarms to help prevent issues and incidents when possible. You may also work with the developer community to develop and execute predictive support or diagnostic routines.
Perform Linux administration tasks, network administration tasks, security incident monitoring to drive your actions.
CIS team members will work with the automation team to learn how the service works, then translate that understanding into runbooks which the entire team will use. As new features and functionality are added to automation, you will also update and evolve the runbooks as needed.
Strong communication and interpersonal skills will help keep the team engaged through incident resolution, including initiating the incident management procedure.

What we need to see:

BS/BE degree in Computer Science, Electronics or equivalent experience.
Minimum of 3 years’ experience administering open system servers in a Production environment of demanding Internet, Cloud, or Telecommunications environments as a Linux Systems Administration, DevOps or SRE role.
Strong problem-solving, analytical, and troubleshooting abilities on Linux Clusters on public or private clouds.
Strong Linux administration experience. Shell scripting, automation, DNS, DHCP, storage concepts, basic networking, IP Tables, etc. RHCE or equivalent level of knowledge.
Experience in remote OS installation on BareMetal nodes, boot issues troubleshooting and hardware faults identification.
Experience scripting in Python and ansible playbooks is preferred but not required.
Knowledge and understanding of application containers, container orchestration systems and git workflow.
Prior experience analyzing system and network performance using monitoring alerts, data, and graphs.
Demonstrate ability to master and maintain complicated environments.

NVIDIA offers highly competitive salaries and a comprehensive benefits package. We have some of the most forward-thinking and talented people in the world working for us and, due to unprecedented growth, our world-class engineering teams are growing fast. If you're a creative and autonomous engineer with real passion for technology, we want to hear from you.

Read Full Description

Apply

Jobs at NVIDIA
Similar Jobs

Confirmed 16 hours ago. Posted 15 days ago.

Discover Similar Jobs

Reliability Engineer

ExxonMobil - Bangalore, India
Reliability Engineer

Chevron - Bangalore, India
Staff Reliability Engineer

Span - Bangalore, India
Functional Safety & Reliability Engineer

Visteon - Bangalore, India

Service Reliability Operations Engineer

NVIDIA

Discover Similar Jobs

Reliability Engineer

Reliability Engineer

Staff Reliability Engineer

Functional Safety & Reliability Engineer

Suggested Articles

Service Reliability Operations Engineer

NVIDIA

Discover Similar Jobs

Reliability Engineer

Reliability Engineer

Staff Reliability Engineer

Functional Safety & Reliability Engineer

Suggested Articles

QA and Testing Jobs at Video Game Companies

Non-Technical Jobs at eSports Companies

Junior Software Jobs & Internships at Media Technology Firms

Hardware & Industrial Design Engineering Jobs at Media Technology Cos

Technology Jobs at Real Industry Partner Companies