More jobs:
AI/ML/DevOps Engineer
Job Description & How to Apply Below
What you ll own
- End‑to‑end MLOps/LLMOps reference architecture: ingestion, validation, feature & embedding pipelines, training & fine‑tuning, evaluation, registry, deployment & monitoring (including RAG and agentic workflows)
- Architect, review and approve CI/CD for ML and LLM systems covering code, data, prompt and model artifact versioning; build and release pipelines in Azure Dev Ops and Git Hub Actions; automated unit, integration, contract testing and promotion/rollback with blue‑green, canary across dev, test and production
- Define and govern AI platform foundations on Azure using IaC (Bicep, Terraform); set up AML work spaces, AKS GPU node pools, private networking, VNet integration, Private Link, identity, Managed Identities, PIM, secrets, Key Vault, encryption and data residency controls
- Review and approve production deployment patterns for model and LLM serving (AKS, KServe, AML online endpoints); oversee containerization, inference optimisation, batching, quantisation, API management, autoscaling, resiliency, RAG runtime components, vector store, retriever, re‑ranker, cache
- Own observability and reliability for AI services:
Open Telemetry tracing, prompt and inference logs with PII controls, latency, throughput, cost metrics, SLOs/SLIs, model performance monitoring, data and model drift detection, LLM evaluation for quality, hallucination checks, toxicity and safety guardrails, incident playbooks - Establish and enforce MLOps/LLMOps governance: dataset lineage, data quality validation, schema and tests, feature store and model registry standards, artifact provenance, SBOM/SLSA, vulnerability scanning, approval gates for model and prompt releases, compliance‑aligned documentation for model risk, intended use limitations, evaluation results
- Enable delivery squads, including the primary delivery partner, with golden path templates (AML pipelines, RAG blueprints, evaluation harnesses, reusable IaC modules, coding standards); run deep technical design and architecture reviews; sign off on production readiness, capacity, security, observability, DR for all AI releases
- Support the Run & Operate model: issue triage and minor enhancement workflows, ticket intake, fix, controlled release; ensure changes follow release governance and quality gates
- Own the Operational Acceptance Gate: no production release without runbooks, monitoring dashboards, incident playbooks, access to model and DR test evidence; provide platform standards, review and sign‑off without replacing the delivery partner’s engineering while enforcing the golden path and production readiness bar
- 8–10 years across Dev Ops, SRE and/or ML Engineering with production systems on Azure
- Hands‑on experience with Azure ML, AKS, Azure Dev Ops or Git Hub Actions, IaC and containerisation
- Bachelor's in Computer Science, Engineering or equivalent experience
- Python, YAML, Docker, Helm, KQL, Git Ops, Argo Flux awareness
- Security in CI/CD, SAST, DAST, supply‑chain security (Sigstore), secrets management, Key Vault
- Performance testing (k6, JMeter), contract testing and end‑to‑end testing
- Cost optimisation and capacity planning for GPU and CPU workloads
- Strong grasp of model serving, inference optimisation and observability tooling
- Microsoft Certified:
Dev Ops Engineer Expert (AZ‑400)
- Microsoft Certified:
Azure Administrator (AZ‑104) or Solutions Architect (AZ‑305) - CKA or CKAD (Kubernetes)
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
Search for further Jobs Here:
×