Apply

10,000+ Similar Jobs

Keystone is a premier strategy, technology, and economics firm that delivers science-led AI

solutions for Fortune 500 companies. We design, deploy, and operate machine learning software that automates and optimizes complex operational and commercial decisions.

Our CoreAI Solutions Group includes world-class AI and ML practitioners with unmatched experience implementing large-scale, high-impact models that help enterprises make smarter decisions across manufacturing, supply chain, sales, and marketing. We bring transformative ideas to life—and ensure they scale, endure, and deliver measurable value.

We are based in Bellevue, WA and New York, Boston, San Francisco, and London.

Position Overview

Keystone is seeking a Principal Customer Reliability Engineering to take end-to-end ownership of how CoreAI software performs in the real world—across SaaS, managed services, and customer-deployed environments. This role is central to our mission of using science-led AI to drive measurable impact for the world’s most complex organizations.

Reporting directly to the Vice President of Engineering, this is a hands-on, high-leverage role where you’ll design, build, and own the systems that ensure our products are resilient, observable, and delivering value at scale. As our first hire in this function, you won’t just shape the strategy—you’ll execute it. You’ll move with urgency, build with intent, and bring engineering clarity to how our systems behave in production.

You’ll be responsible for developing robust telemetry, monitoring, and alerting pipelines that provide real-time visibility into system health and long-term insights into model performance, stability, and drift. These systems will be critical to maintaining trust, reliability, and fairness in high-impact production environments.

Our ideal candidate thrives at the intersection of software engineering and customer empathy, brings a scientific mindset to operational excellence, and leverages every available tool—from AI-native development environments to open source observability stacks—to deliver durable, high-performance solutions.

You’ll work closely with a cross-functional team of scientists, engineers, economists, and strategists to establish the foundation of a scalable, ethical, and production-grade AI platform built to adapt to real-world complexity.

What You Will Do

Lead reliability, observability, and operational feedback systems for CoreAI software
across SaaS, managed services, and customer-hosted environments.
Build from scratch the infrastructure that tracks uptime, latency, usage, errors, and model behavior—ensuring full visibility into live deployments.
Design telemetry, monitoring, and alerting pipelines to surface real-time system health

and long-term trends in model stability, fairness, and drift.

Establish and operate incident management and correction-of-error (CoE) processes that promote transparency, learning, and continuous improvement.
Create dashboards that turn raw telemetry into actionable insights for engineering,
science, and executive stakeholders.
Collaborate with platform, science, and deployment teams to embed reliability and

observability standards into the development lifecycle.

Leverage LLM-based development tools and automation to reduce friction, accelerate

delivery, and extend your own impact.

Codify operational playbooks and deployment patterns into repeatable practices that scale across customers.
Act as a cross-functional leader—bringing engineering truth to customer engagements,
product planning, and executive decision-making.

The Ideal Candidate

Combines strategic leadership with tactical execution—equally comfortable setting direction and writing code.
Brings a builder’s mindset, and thrives in a zero-to-one environment with urgency,
precision, and ownership.
Prioritizes customer outcomes and understands how system reliability translates to
business impact and trust.
Leverages AI-native tools and automation to build smarter and faster.
Upholds a high bar for engineering rigor, ethical responsibility, and long-term
maintainability in production systems.
Collaborates across disciplines and adapts quickly to change in a fast-moving, high-
accountability environment.

Minimum Qualifications

15+ years of experience in software engineering, SRE, DevOps, or platform operations
Proven ownership of production systems in mission-critical, customer-facing
environments
Strong proficiency in Python, Go, or equivalent languages
Deep experience with observability tools (Datadog, Prometheus, Grafana,
OpenTelemetry)
Familiarity with AWS and cloud-native services; experience with hybrid or customer-
hosted environments preferred
Demonstrated success in leading incident response, root cause analysis, and CoE
workflows
Experience in early-stage or high-velocity environments

Preferred Qualifications

Experience with model operations (MLOps) or ML observability systems
Exposure to fairness monitoring or ethical considerations in live AI systems
Familiarity with reinforcement learning, probabilistic models, or stochastic performance monitoring
Hands-on use of AI-native development tools (e.g., GitHub Copilot, Codeium, etc.)

US Salary Range: $235,000 - $280,000, plus an annual discretionary bonus, 401k contribution, and competitive benefits package. Actual compensation within the range will depend upon the level the individual is hired into based on their skills, experience, qualifications.

At Keystone we believe diversity matters. At every level of our firm, we seek to advance and promote diversity, foster an inclusive culture, and ensure our colleagues have a deep sense of respect and belonging. If you are interested in growing your career with colleagues from varied backgrounds and cultures, consider Keystone Strategy.

Read Full Description

Apply

Jobs at Keystone Strategy
Similar Jobs

Confirmed 11 hours ago. Posted 5 days ago.