MLops Engineer Job Indianapolis Indiana USA,IT/Tech

Location: Indianapolis

Job Description

Insight Global is seeking a Machine Learning Reliability Engineer for a large enterprise client modernizing and scaling its ML/AI platform. This role focuses on ensuring ML systems are reliable, observable, and cost‑efficient engineer will define SLOs, build robust Datadog monitoring, standardize incident response, and partner closely with Fin Ops and governance teams. This is a highly visible role critical to production ML success - ideal for an SRE who understands ML workloads and wants to own reliability, observability, and operational excellence across enterprise AI systems.

We are a company committed to creating diverse and inclusive environments where people can bring their full, authentic selves to work every day. We are an equal opportunity/affirmative action employer that believes everyone matters. Qualified candidates will receive consideration for employment regardless of their race, color, ethnicity, religion, sex (including pregnancy), sexual orientation, gender identity and expression, marital status, national origin, ancestry, genetic factors, age, disability, protected veteran status, military or uniformed service member status, or any other status or characteristic protected by applicable laws, regulations, and ordinances.

If you need assistance and/or a reasonable accommodation due to a disability during the application or recruiting process, please send a request to To learn more about how we collect, keep, and process your private information, please review Insight Global's Workforce Privacy Policy:

Skills and Requirements

Strong background in Site Reliability Engineering (SRE) principles
Hands‑on Datadog experience (dashboards, metrics, logs, traces, alerting)
Experience supporting ML/AI systems in production
Ability to define and enforce SLOs / SLIs for distributed systems
Monitoring of availability, latency, accuracy, drift, and pipeline health
Experience operating in cloud environments (Azure strongly preferred)
Proven skills in performance tuning and cost optimization
Incident response ownership (alerts, runbooks, escalation paths)
ML‑specific observability (model performance, drift, LLM monitoring)
AI / LLM observability experience
Snowflake and modern data platform monitoring
Fin Ops partnership experience
Service Now integration (incident & change management)
Enterprise audit, governance, and compliance exposure

#J-18808-Ljbffr