Site Reliability Engineer

Copart

Copart is seeking a Site Reliability Engineer for our Dallas HQ office specializing in Systems and application monitoring and troubleshooting. This position will be part of a 24/7 Global Network Operations team that monitors and provides L1/L2 support to meet the SLA commitment of Copart's Global Data Center and Application infrastructure.

Ideal Candidate:

Team Player -- Candidate that works well in a collaborate team environment. Effective communication skills and a great personality is a must.

Talented -- Your skillsets expand beyond the core knowledge of Windows, UNIX, or Linux platforms. Not only should you be knowledgeable in core Systems, experience in VM environment, Networking, Scripting, Automation, Kubernetes, and awareness of other technologies is a plus.

Innovative -- We are always looking for ways improve our process and procedures. The ideal candidate should have natural desire to make things better and not be afraid to speak up if an opportunity for improvement arises.

Essential Duties and responsibilities:

  • Perform application deployments using Jenkins and Spinnaker on Prod and Non-Prod Environments.
  • Coordinate and Perform periodic failover testing of Copart's Network/Systems Infrastructure and application environments.
  • Build/Optimize tools with Python, Ansible and Grafana to monitor/collect key metrics and automate remediation of Infrastructure or application issues.
  • Perform monthly security patching of Systems OS and applications.
  • Maintenance and Optimization of the following tools and repositories (Nagios, Netbox, Prometheus, Grafana, Sumologic, Selenium, Instana, Github and more...)
  • Interface with internal teams (Product development, DevOps, Network, Systems and DB)
  • Utilize internal monitoring tools to analyze and pro-actively monitor Copart's Global Data Center and Application infrastructure to catch and quickly resolve issues before it arises.
  • Quickly and efficiently communicate issues with several of Copart's domains.
  • Develop analysis and reporting capabilities; monitor performance and quality control plans to identify improvements.
  • Document standard operating procedures, diagrams, and training materials for use by the teams.

Requirements:

  • Progressive knowledge of monitoring protocols such as SNMP, Netflow, Syslog etc.
  • Intermediate programming and scripting knowledge
  • Knowledge in different types of monitoring methodologies i.e Agent and agentless checks.
  • Troubleshooting knowledge with Linux/Unix/Windows based systems
  • Working with VM management software - Vsphere
  • Knowledge of monitoring tools, Nagios, Solar Winds, Site24*7
  • Be flexible and be able to handle competing/changing priorities.
  • Very strong oral and written communication skills
  • Must be a self-starter with the ability to work well in a team environment
  • Flexible schedule required
  • Knowledge of the areas are a BIG plus
  • Dashboard applications such as Grafana
  • Scripting/Programing/Automation -- Python, Bash, Ansible, Stackstorm
  • Experience working with Github, Jenkins, Spinnaker, Docker, Kubernetes
  • Front end scripting languages, libraries and frameworks such as Java, Javascript, Angular JS, Flask etc.

#LI-MS1

Read Full Description
Confirmed 7 hours ago. Posted 30+ days ago.

Discover Similar Jobs

Suggested Articles