Senior ML Ops Engineer Job Columbus area,Ohio USA,IT/Tech

Senior ML Ops Engineer

Overview

As a Senior ML Ops Engineer at Mimecast, you will be a technical leader on the AI Enablement Platform (AIP) team, responsible for ensuring that machine learning models and AI agents are deployed, scaled, observed, and maintained reliably across production environments. The AI Enablement Platform serves billions of requests per month across multiple regions, powering AI-driven capabilities in email security, insider risk, data loss prevention, and collaboration security for Mimecast's Human Risk Management platform.

This role sits at the intersection of infrastructure engineering and machine learning. You will own the design and implementation of self-service deployment tooling, platform resilience and scaling infrastructure, and operational best practices that enable ML Engineers and Data Scientists to ship models and agents independently, with confidence. You will also be responsible for building and maintaining the developer platform that accelerates the work of ML practitioners across the organization.

This is a senior individual contributor role. You are expected to drive architectural decisions, mentor other engineers, define standards, and operate with a high degree of autonomy. You will collaborate closely with ML Engineers, Software Engineers, SRE, and Cloud Platform teams.

AI-First Engineering at Mimecast

Mimecast is an AI-First engineering organization. Our teams actively leverage AI-powered development tools across all facets of engineering, from code development to testing, documentation, and operations. We're looking for leaders who don't just use AI tools but champion their adoption and establish new ways of working.

Our AI leadership extends beyond how we build to what we build. Our Mihra AI agent delivers 7x faster threat response for customers, and we're recognized as "Agents of Change" in Human Risk Management. Engineers here work at the intersection of cutting-edge AI tooling and AI-powered security products that protect organizations worldwide.

What You'll Do:

* Self-Service Deployment Tooling:
Design and build config-driven, validated workflows that enable ML Engineers to deploy models to AIP infrastructure without requiring hands-on ML Ops involvement for each release. This includes automated validation pipelines, standardized configuration schemas, endpoint provisioning, and derisked rollout patterns (canary, blue-green, rollback).

* Platform Resilience and Scaling:
Own the reliability and scalability of ML inference infrastructure. Design and tune autoscaling policies against real production traffic patterns, implement rate limiting and back pressure mechanisms (HTTP 429, retry-after) at the API layer, and build request prioritization frameworks (real-time vs. batch) so the platform protects itself under load without manual intervention or consumer-side changes.

* Observability and Monitoring:
Develop and maintain the platform's observability stack (metrics, logging, tracing, alerting) so that monitoring is wired in by default for every deployed model and agent. Continuously monitor model performance, data drift, latency, error rates, and system health. Build dashboards and alerting that give both the AIP team and consuming teams visibility into their workloads. All team members, including leadership, participate in an on-call rotation (12 hour shifts).

* CI/CD and Automation:
Design, implement, and maintain robust CI/CD pipelines for ML model and infrastructure deployments. Automate testing (functional, integration, performance) as pre-deployment gates that ML Engineers can trigger themselves, with clear pass/fail criteria.

* Infrastructure as Code:
Manage all AIP infrastructure through Terraform and configuration management tooling. Maintain multi-region deployment capabilities and ensure infrastructure changes are reviewable, repeatable, and auditable.

* Cost Optimization:
Implement and enforce cost tagging and allocation at deployment time. Optimize ML inference endpoints for cost-effectiveness, including right-sizing instance types, managing reserved capacity, and providing opinionated endpoint configuration recommendations based on model characteristics.

* Agent and LLM Operations:
Support the deployment and operational management of AI agents and LLM-based capabilities within the AIP's templatized agent framework. This includes infrastructure for agent hosting, tool access configuration, and observability for agentic workloads.

* Security and Compliance:
Ensure ML systems adhere to security best practices, including input validation, authentication, network exposure controls, and automated security scanning for model configurations. Support compliance with regulatory requirements relevant to AI systems in the cybersecurity domain.

* Technical Leadership:
Mentor ML Ops and ML Engineers on operational best practices. Participate in architectural reviews, contribute to platform governance, and drive engineering standards through documentation, code reviews, and…