Site Reliability Engineer - Cloud (AWS / GCP) / Python / Kafka

Juniper

Benefits
Qualifications

Juniper's AI Driven Enterprise (AIDE), is seeking a full-time SRE to join our talented team and build high quality technology solutions that revolutionize wireless networks, powered by Artificial Intelligence in the cloud. Mist provides services through SaaS applications to several Fortune 100 and Fortune 500 customers. You'll take ops projects from concept through to launch. You will be responsible for maintaining and improving the company's production environment for rapid scaling and outstanding performance. You will be responsible to help us keep stellar uptime and reliability. The improvements you implement will be felt by the entire organization.

As s Site Reliability Engineer (SRE) at Juniper Networks, you will be responsible for keeping our cloud-based services, streaming frameworks, NoSQL/RDBMS databases and distributed analytical platforms running in multi-cloud environments to deliver unprecedented IT automation and insight into user experiences driven by our AI services over a geographically distributed customers’ networks.

Responsibilities:

  • Build infrastructure as a code using Terraform, Ansible and Kubernetes
  • Manage and performance tune either databases (Postgres, Redis, Cassandra, Elasticsearch) or streaming data pipelines (Kafka, Flink, Storm, Spark frameworks)
  • Manage CICD pipelines, configuration, automation tools for infrastructure provisioning.
  • Write and maintain runbooks for knowledge driven automated processes and bots.
  • Do capacity planning based on performance, usage, and utilization stats.
  • Partner with developers and quality engineering teams to automate the monitoring, alerting, availability and scalability of our applications and systems.
  • Ensure system availability and business continuity by implementing redundant servers/services.
  • Manage after-hours infrastructure updates and maintenance.
  • Proactively research and propose the use of new concepts, processes, technologies, and tools.
  • Proactive monitoring, diagnosis, on-call rotation and resolution of issues in a 24x7 of multi-cloud environment (AWS/GCP), analyze failures and provide support for software engineers to debug production issues across microservices and distributed platforms.
  • Follow SRE best practices and procedures.

Experience required for you to be successful:

  • Follow SRE best practices and procedures.
  • An extensive background in developing and operating large-scale cloud-based distributed applications
  • Direct experience developing/running applications on AWS and Google Cloud.
  • Laser focus and be able to design infrastructure solutions for scalability, reliability, high availability, performance, software maintainability, and operational excellence
  • The ability to "fix the plane while in flight" (not just support greenfield solutions)
  • The ability to prioritize existing technical and infrastructure debt, and experience to build and execute a plan to pay it off

Required skills:

  • Bachelor’s degree in Computer Science or Computer Engineering or equivalent 
  • Basic programming skills in Python, Java, or Golang. 
  • Understanding of distributed systems. 
  • Understanding of data management technologies including relational and non-relational databases. 
Read Full Description
Confirmed 17 hours ago. Posted 30+ days ago.

Discover Similar Jobs

Suggested Articles