Principal AI/ML Operations Engineer
Listed on 2025-12-03
-
IT/Tech
AI Engineer, Systems Engineer
Get to Know Us:
It's fun to work in a company where people truly believe in what they're doing!
At Black Line, we're committed to bringing passion and customer focus to the business of enterprise applications.
Since being founded in 2001, Black Line has become a leading provider of cloud software that automates and controls the entire financial close process. Our vision is to modernize the finance and accounting function to enable greater operational effectiveness and agility, and we are committed to delivering innovative solutions and services to empower accounting and finance leaders around the world to achieve Modern Finance.
Being a best-in‑class SaaS Company, we understand that bringing in new ideas and innovative technology is mission critical. At Black Line we are always working with new, cutting edge technology that encourages our teams to learn something new and expand their creativity and technical skillset that will accelerate their careers.
Work, Play and Grow at Black Line!
Make Your Mark:The Principal AI/ML Operations Engineer leads the architecture, automation, and operationalization of both machine learning and AI systems s role defines the strategy and technical standards for ML‑Ops and AIOps across the organization, ensuring models and agents are evaluated, deployed, governed, and monitored with reliability, efficiency, and compliance. The candidate will collaborate across AI, data, and product engineering teams to drive best practices for serving, observability, automated retraining, evaluation flywheels, and operational guardrails for AI systems in production.
You’llGet To:
Leadership and Strategy
- Define enterprise‑level standards and reference architectures for ML‑Ops and AIOps systems.
- Partner with data science, security, and product teams to set evaluation and governance standards (Guardrails, Bias, Drift, Latency SLAs).
- Mentor senior engineers and drive design reviews for ML pipelines, model registries, and agentic runtime environments.
- Lead incident response and reliability strategies for ML/AI systems.
- Lead the deployment of AI models and systems in various environments.
- Collaborate with development teams to integrate AI solutions into existing workflows and applications.
- Ensure seamless integration with different platforms and technologies.
- Define and manage MCP Registry for agentic component onboarding, lifecycle versioning, and dependency governance.
- Build CI/CD pipelines automating LLM agent deployment, policy validation, and prompt evaluation of workflows.
- Develop and operationalize experimentation frameworks for agent evaluations, scenario regression, and performance analytics.
- Implement logging, metering, and auditing for agent behavior, function calls, and compliance alignment.
- Create scalable observability systems—tracking conversation outcomes, factual accuracy, latency, escalation patterns, and safety events.
- Architect end‑to‑end guardrails for AI agents including prompt injection protection, identity‑aware routing, and tool usage authorization.
- Collaborate cross‑functionally to standardize authentication, authorization, and session governance for multi‑agent runtimes.
- Architect and standardize model registries and feature stores to support version tracking, lineage, and reproducibility across environments.
- Lead the deployment of machine learning models into production environments, ensuring scalability, reliability, and efficiency.
- Collaborate with software engineers to integrate machine learning models into existing applications and systems.
- Implement and maintain APIs for model inference.
- Design and manage training infrastructure including distributed training orchestration, GPU/TPU resource allocation, and automatic scaling.
- Implement CI/CD for model workflows using pipelines integrated with model validation, bias checks, and rollback automation.
- Build standardized experimentation frameworks for reproducible training, tuning, and deployment cycles (MLflow, W&B, Kubeflow).
- Manage and optimize the infrastructure required for machine learning operations in cloud.
- Work…
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).