More jobs:
Machine Learning Operations; MLOps Architect - Generative Al Focus
Job in
Arlington, Arlington County, Virginia, 22201, USA
Listed on 2026-05-09
Listing for:
Kapitus
Full Time
position Listed on 2026-05-09
Job specializations:
-
IT/Tech
AI Engineer, Machine Learning/ ML Engineer
Job Description & How to Apply Below
What You’ll Do
- MLOps & GenAI Platform Architecture
- Design and implement scalable ML and LLM infrastructure on AWS (Sage Maker, EKS, S3, IAM, Lambda, Step Functions, Cloud Watch).
- Architect end-to-end ML and Generative AI lifecycle workflows:
- Data ingestion & preprocessing
- Feature engineering / embedding generation
- Model training & fine‑tuning (traditional ML + foundation models)
- Model evaluation & validation
- Deployment (real‑time, batch, streaming)
- Monitoring & retraining
- Integrate LLM pipelines (prompt workflows, RAG architectures, fine‑tuning flows) into the enterprise MLOps stack.
- Define standards for CI/CD/CT pipelines across ML and GenAI workloads.
- Generative AI & LLM Operationalization
- Architect Retrieval‑Augmented Generation (RAG) pipelines including:
- Embedding generation workflows
- Vector database integration
- Document ingestion and chunking strategies
- Retrieval evaluation and monitoring
- Design and deploy LLM‑based services using:
- Managed services (e.g., Sage Maker endpoints, Bedrock‑style APIs)
- Containerized custom inference services
- Establish prompt versioning, evaluation frameworks, and experiment tracking for LLM systems.
- Implement guardrails for hallucination control, safety monitoring, bias detection, and usage logging.
- Define architecture for LLM fine‑tuning workflows (including data curation, evaluation, and cost controls).
- Implement scalable orchestration of LLM pipelines using workflow engines and event‑driven patterns.
- Architect Retrieval‑Augmented Generation (RAG) pipelines including:
- Deployment, Monitoring & Reliability
- Architect scalable inference patterns for:
- Traditional ML models
- LLM APIs
- RAG systems
- Implement model monitoring frameworks for:
- Performance degradation
- Drift detection
- LLM output quality
- Latency and token usage metrics
- Define SLAs/SLOs for ML and GenAI systems.
- Design safe deployment strategies (blue/green, canary, shadow testing).
- Establish logging, observability, and traceability standards for GenAI systems.
- Architect scalable inference patterns for:
- Fin Ops & Cost Optimization
- Implement cost tracking for:
- Training workloads (GPU utilization)
- Inference endpoints (Token consumption for LLM APIs)
- Vector database storage
- Optimize LLM workloads for cost-performance tradeoffs (model size, batching, caching strategies).
- Design autoscaling and compute optimization strategies for GPU and CPU‑based inference.
- Partner with finance and engineering teams to forecast ML/GenAI infrastructure spend.
- Implement cost tracking for:
- Platform Enablement & Standards
- Define enterprise standards for:
- Experiment tracking
- Model registry
- Prompt registry
- Artifact management
- Embedding versioning
- Provide architectural guidance to data science, AI, and engineering teams.
- Evaluate and recommend tooling across the ML/GenAI stack (MLflow, feature stores, vector databases, orchestration tools).
- Drive documentation and reusable patterns for ML and GenAI development.
- Define enterprise standards for:
- 6+ years of experience in ML engineering, data engineering, or MLOps roles.
- Proven experience architecting ML platforms in AWS.
- Strong hands‑on experience with Sage Maker (training, pipelines, deployment).
- Experience operationalizing LLM or Generative AI systems in production.
- Experience building RAG pipelines and integrating vector databases.
- Experience working with Databricks in production.
- Experience implementing data governance and catalog systems (e.g., Atlan).
- Strong understanding of CI/CD principles for ML and GenAI.
- Experience with containerization (Docker) and orchestration (Kubernetes/EKS).
- Deep knowledge of infrastructure‑as‑code (Terraform, Cloud Formation).
- Strong understanding of observability and monitoring for ML systems.
- Experience implementing cloud cost optimization strategies (Fin Ops).
- Strong Python proficiency.
- Experience with foundation model fine‑tuning and parameter‑efficient methods.
- Experience implementing model registries and experiment tracking tools.
- Experience designing feature stores and embedding stores.
- Familiarity with AI risk management, bias mitigation, and safety controls.
- Experience supporting regulated or data‑sensitive environments.
- Platform‑level architectural thinking.
- Deep understanding of how to integrate GenAI into enterprise ML ecosystems.
- Ability to balance scalability, governance, security, performance, and cost.
- Strong technical leadership and cross‑functional…
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
Search for further Jobs Here:
×