Overview
The primary objective of this role is to establish and lead a centralized operations function for PepsiCo's Gen-AI platforms, including PepGenX and Agent Factory. The role ensures platform reliability, incident management, vendor performance, and operational excellence across all sectors. It is critical to sustaining enterprise-scale AI adoption by overseeing 24x7 monitoring, ServiceNow-based workflow automation, and cloud FinOps, while coordinating with engineering, product, and vendor teams to maintain high availability, security, and performance at scale
Responsibilities
- 24x7 uptime and incident management via a dedicated Gen-AI Command Centre,
- Scalable platform monitoring and alerting across PepGenX, Agent Factory, and LLM stacks,
- Integration with ServiceNow for workflows, change control, and audit trails,
- Vendor accountability and contractor performance governance,
- Cost efficiency and platform FinOps using Azure-native tools,
- Enterprise security, compliance, and operational SLAs as Gen-AI becomes business-critical
Qualifications
- 12+ years of experience in IT operations, with a focus on AI/ML systems and cloud infrastructure.
- Proven expertise in setting up and managing command centers or NOCs for large-scale platforms.
- Deep knowledge of Azure cloud services, including monitoring, security, and FinOps tools.
- Extensive experience with ServiceNow, including incident management, change control, and automation workflows.
- Strong understanding of ML and data science operations, including model deployment and monitoring.
- Demonstrated ability to manage contractors and third-party vendors effectively.
- Excellent problem-solving skills and the ability to make decisions under pressure.
- Strong leadership and communication skills, with experience managing cross-functional teams.
Read Full Description