Machine Learning; ML Operations Engineer Job Bradenton area,Florida USA,IT/Tech

Position: Staff Machine Learning (ML) Operations Engineer

Overview

CoAdvantage is an HCM company providing payroll, ASO, and PEO services to 16,000 clients. We deliver payroll, benefits, HR compliance, time/PTO, and risk management solutions, and we are building a governed AI platform that will become a primary source of differentiation versus AI-native competitors. The AI program runs three substrates (engineering knowledge graph, analytics feature store, customer knowledge store) and a multi-agent harness.

The Principal AI Architect designs the platform. The Staff MLOps Engineer makes it operationally real, repeatable, and safe to deploy.

What You 27ll Own

You are the operational backbone of the AI platform

Build and own the deployment pipelines for models, agents, prompts, evals, feature definitions, and KG/vector indices. Everything that touches production goes through a pipeline you wrote.
Operate the feature store (offline + online), the knowledge graph infra (ADO-KG and Customer Graph), and the vector indexing layer- ingestion, materialization, freshness, drift, lineage.
Stand up the eval harness as CI: every agent, prompt, and model change runs its eval suite on PR; a regression that breaks an eval blocks merge.
Wire the observability plane: traces for every agent step, prompts and tool calls captured with PII redaction, cost and latency SLOs per surface, drift monitors, on-call runbooks.
Operate the HITL queue infrastructure- routing, SLAs, audit, and the feedback loop back into evals and the KG.
Own incident response for AI surfaces: cross-tenant leakage, prompt injection, agent loop runaway, capability drift, KG poisoning. You write the runbooks and you carry the pager.
Manage cost, capacity, and model routing across LM tiers (frontier vs. cheap-and-fast)- agents should land on the right tier automatically, with budgets and circuit breakers.
Own secrets, identity, and AuthZ enforcement at the infra layer- tenant scoping must be enforced independently of the LLM, every time.
You will write a lot of code. You will not be a "platform PM".

How We Work

AI-first coding. Claude Code, Copilot, and successor tools are the default development surface. We expect you to author pipelines, IaC, runbooks, eval harnesses, and operators with agentic coding tools in the loop.
Build your own agentic workflows. Repetitive ops work- incident triage, drift investigation, eval failure root-cause, capacity forecasting- gets automated as an agentic workflow you author and own.
Every workflow is testable. Every pipeline, every agentic ops workflow, every runbook has tests: unit, integration, eval-on-PR, replay against a golden incident set.
Ambiguity is the job. Specs from the Architect will be 80% complete on purpose. You fill the last 20% by shipping, instrumenting, and reporting back what the operational reality is.
You estimate. Every workstream returns with a timeline, a confidence interval, an explicit list of dependencies, and the smallest version you could ship in a week.
You suggest the tools. Specific opinions on orchestration (Dagster vs. Airflow vs. Prefect), serving, tracing, feature store, registry, and vector indexer- and the willingness to defend them.

First 90 Days

Ship the deployment pipeline for one agent end-to-end: code D eval-on-PR staged rollout traced production drift monitor rollback. Used by the first production agent.
Stand up eval-as-CI: PRs to any agent, prompt, or model run their suite automatically; failures block merge; results posted to the PR.
Bring up an online feature store for the MLR Pricing / Premium Tiering model with a freshness watchdog and a fail-closed posture on stale features.
Define and implement the cross-tenant leakage probe as a continuous CI check against the Customer Knowledge Store retrieval layer.
Publish the incident runbook set for the four catastrophic-tier risks (cross-tenant leakage, prompt injection, KG poisoning, agent loop runaway) and rehearse one of them with the team.

Required Skills & Experience

5+ years of production software / platform engineering; at least 3 years operating ML or LLM systems at meaningful scale.
Strong Python + at least one IaC stack (Terraform, Pulumi, Bicep). Comfortable in containers, Kubernetes, and…