Senior Machine Learning Engineer
Listed on 2026-01-04
-
IT/Tech
AI Engineer, Machine Learning/ ML Engineer
About the Job
Are you passionate about shaping the future of AI by building infrastructure that ensures large language models and AI agents are safe, reliable, and aligned with human values? Red Hat's Open Shift AI team is seeking a principal ML engineer who combines deep technical expertise with a commitment to responsible AI innovation.
As a pivotal contributor to open-source projects like Open Data Hub ((Use the "Apply for this Job" box below).), KServe (), Trusty
AI (), Kubeflow (), and llama-stack (), you'll be at the forefront of democratizing trustworthy AI infrastructure. These critical open-source initiatives are transforming how organizations develop, deploy, and monitor machine learning models across hybrid cloud and edge environments. Your work will directly shape the next generation of MLOps platforms, making advanced AI technologies more accessible, secure, and ethically aligned.
In today's rapidly evolving technological landscape, AI is becoming an integral part of our lives, powering everything from daily apps to complex systems in healthcare, finance, and beyond. While this is exciting, the focus on "what AI can do" has overshadownly "how it can do it safely".
Our Team’s mission is to create reliable AI systems that humans can trust. We do this by making AI safety both practical and accessible
. Practical in the sense that you can implement it today, reducing complexity to facilitate adoption by developers and organizations in real-world environments, and accessible in the sense that our tools are open source and free from vendor lock-in.
Architect and lead development of large-scale evaluation platforms for LLMs and agents, enabling automated, reproducible, and extensible assessment of accuracy, reliability, safety, and performance across diverse domains.
Define organizational standards and metrics for LLM/agent evaluation, covering hallucination detection, factuality, bias, robustness, interpretability, and alignment drift.
Build platform components and APIs that allow product teams to integrate evaluation seamlessly into training, fine-tuning, deployment, and continuous monitoring workflows.
Design automated pipelines and benchmarks for adversarial testing, red‑teaming, and stress testing of LLMs and retrieval‑augmented generation (RAG) systems.
Lead initiatives in multi‑dimensional evaluation, including safety (toxicity, bias, harmful outputs), grounding (retrieval correctness, source attribution), and agent behaviors (tool use, planning, trustworthiness).
Collaborate with cross‑functional stakeholders (safety, product, research, infrastructure) to translate abstract evaluation goals into measurable, system‑level frameworks.
Advance interpretability and observability, developing tools that allow teams to understand, debug, and explain LLM behaviors in production.
Mentor engineers and establish best practices, driving adoption of evaluation‑driven development across the organization.
Influence technical roadmaps and industry direction, representing the team’s evaluation‑first approach in external forums and publications.
5+ years of ML engineering experience, with 3+ years focused on large‑scale evaluation of transformer‑based LLMs and/or agentic systems.
Proven experience building evaluation platforms or frameworks that operate across training, deployment, and post‑deployment contexts.
Deep expertise in designing and implementing LLM evaluation metrics (factuality, hallucination detection, grounding, toxicity, robustness).
Strong background in scalable platform engineering, including APIs, pipelines, and integrations used by multiple product teams.
Demonstrated ability to bridge research and engineering, operationalizing safety and alignment techniques into production evaluation systems.
Proficiency in Python, PyTorch, Hugging Face, and modern ML ops/deployment environments.
Track record of technical leadership, including mentoring, architecture design, and defining org‑wide practices.
Experience with multi‑agent evaluation frameworks and graph‑based metrics for agent interactions.
Background in retrieval‑augmented…
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).