×
Register Here to Apply for Jobs or Post Jobs. X

AI Operations

Job in 411001, Pune, Maharashtra, India
Listing for: Allianz Commercial
Full Time position
Listed on 2026-02-14
Job specializations:
  • IT/Tech
    Cloud Computing, SRE/Site Reliability
Job Description & How to Apply Below
This job is with Allianz Commercial, an inclusive employer and a member of my Gwork – the largest global platform for the LGBTQ+ business community. Please do not contact the recruiter directly.

Job Title
AI Automation - Operations Engineer
Overview
We are hiring an AI Automation Operations Engineer to own operational excellence for our AI & Automation products and AIOps platforms. This role spans end-to-end reliability across infrastructure, application, middleware, and AI/GenAI layers. You will design monitoring and health checks, lead platform upgrades and high‑availability setups, drive stability and incident management, enable product adoption, document production processes, and contribute to pre‑prod testing and release readiness.
Core Responsibilities
Monitoring and Observability
Design and implement  comprehensive monitoring, alerting, and health‑check frameworks across infra, app, middleware, and AI/GenAI layers.
Build dashboards  and SLO/SLA telemetry using Grafana, Dynatrace, Azure Monitor, Application Insights, Log Analytics, or equivalent.
Define key metrics  (availability, latency, error rates, model drift, pipeline throughput) and set automated alerts and escalation paths.
Automate health checks  and synthetic transactions for critical user journeys and model inference paths.
Upgrades, High Availability, and Roadmap
Lead platform and product upgrades , including Active‑Active, Active‑Passive, blue/green and canary deployment strategies.
Plan and own upgrade roadmaps  in collaboration with Ops, GCC, Engineering, Product, and stakeholders; coordinate maintenance windows and rollback plans.
Validate upgrades  in pre‑prod and staging, ensure zero/low downtime cutovers, and document upgrade runbooks.
Stability, Incident and Problem Management
Own incident lifecycle  from detection to resolution and RCA; run incident response and post‑mortems.
Drive reliability engineering  practices: capacity planning, performance tuning, chaos testing, and resilience patterns.
Implement automation  for remediation, runbook execution, and incident mitigation to reduce MTTR.
Maintain SLAs  and report availability and reliability metrics to stakeholders.
Enablement and Adoption
Deliver enablement sessions , workshops, and demos to internal teams and customers on how to use AI Automation products.
Create and maintain user manuals, quick start guides, runbooks, and FAQs  tailored to operators, developers, and business users.
Act as SME  for onboarding, troubleshooting, and best practices for GenAI/LLM usage and safe model operations.
Production Process Control and Documentation
Map and document production processes , data flows, deployment pipelines, and operational dependencies.
Create runbooks, SOPs, and playbooks  for routine operations, change management, and emergency procedures.
Establish governance  for change approvals, configuration management, and access controls.
Testing and Release Support
Contribute to pre‑prod testing : functional, integration, performance, load, and model validation tests.
Coordinate release readiness  with QA, Dev Ops, and engineering; validate CI/CD pipelines and rollback mechanisms.
Support canary and staged rollouts , monitor metrics during releases, and authorize promotion to production.
Cross‑Functional Collaboration and Vendor Management
Work closely with Dev, SRE, Security, QA, and Product  to prioritize reliability work and roadmap items.
Coordinate with cloud providers and third‑party vendors  for escalations, upgrades, and capacity planning.
Communicate status and risks  to leadership and stakeholders with clear, actionable reports.
Required Technical Skills
Programming and Scripting :
Python or Node.js for automation, monitoring scripts, and tooling.
Monitoring and Observability :
Hands‑on with Grafana, Dynatrace, Azure Monitor, Application Insights, Log Analytics, Prometheus, or equivalent.
Cloud Platforms :

Experience with Azure (preferred) or AWS/GCP; infrastructure provisioning and cost optimization.
Containers and Orchestration :
Docker and Kubernetes (AKS/EKS/GKE) operational experience.
CI/CD and Dev Ops :
Git, Jenkins/Git Hub Actions/Git Lab CI, pipeline troubleshooting and release…
Note that applications are not being accepted from your jurisdiction for this job currently via this jobsite. Candidate preferences are the decision of the Employer or Recruiting Agent, and are controlled by them alone.
To Search, View & Apply for jobs on this site that accept applications from your location or country, tap here to make a Search:
 
 
 
Search for further Jobs Here:
(Try combinations for better Results! Or enter less keywords for broader Results)
Location
Increase/decrease your Search Radius (miles)

Job Posting Language
Employment Category
Education (minimum level)
Filters
Education Level
Experience Level (years)
Posted in last:
Salary