MLops Engineer
Listed on 2026-04-29
-
IT/Tech
Cloud Computing, SRE/Site Reliability
Job Description
Insight Global is seeking a Machine Learning Reliability Engineer for a large enterprise client modernizing and scaling its ML/AI platform. This role focuses on ensuring ML systems are reliable, observable, and cost‑efficient engineer will define SLOs, build robust Datadog monitoring, standardize incident response, and partner closely with Fin Ops and governance teams. This is a highly visible role critical to production ML success - ideal for an SRE who understands ML workloads and wants to own reliability, observability, and operational excellence across enterprise AI systems.
We are a company committed to creating diverse and inclusive environments where people can bring their full, authentic selves to work every day. We are an equal opportunity/affirmative action employer that believes everyone matters. Qualified candidates will receive consideration for employment regardless of their race, color, ethnicity, religion, sex (including pregnancy), sexual orientation, gender identity and expression, marital status, national origin, ancestry, genetic factors, age, disability, protected veteran status, military or uniformed service member status, or any other status or characteristic protected by applicable laws, regulations, and ordinances.
If you need assistance and/or a reasonable accommodation due to a disability during the application or recruiting process, please send a request to To learn more about how we collect, keep, and process your private information, please review Insight Global's Workforce Privacy Policy:
- Strong background in Site Reliability Engineering (SRE) principles
- Hands‑on Datadog experience (dashboards, metrics, logs, traces, alerting)
- Experience supporting ML/AI systems in production
- Ability to define and enforce SLOs / SLIs for distributed systems
- Monitoring of availability, latency, accuracy, drift, and pipeline health
- Experience operating in cloud environments (Azure strongly preferred)
- Proven skills in performance tuning and cost optimization
- Incident response ownership (alerts, runbooks, escalation paths)
- ML‑specific observability (model performance, drift, LLM monitoring)
- AI / LLM observability experience
- Snowflake and modern data platform monitoring
- Fin Ops partnership experience
- Service Now integration (incident & change management)
- Enterprise audit, governance, and compliance exposure
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).