×
Register Here to Apply for Jobs or Post Jobs. X

Machine Learning Operations; MLOps Architect - Generative Al Focus

Job in Arlington, Arlington County, Virginia, 22201, USA
Listing for: Kapitus
Full Time position
Listed on 2026-05-09
Job specializations:
  • IT/Tech
    AI Engineer, Machine Learning/ ML Engineer
Salary/Wage Range or Industry Benchmark: 80000 - 100000 USD Yearly USD 80000.00 100000.00 YEAR
Job Description & How to Apply Below
Position: Machine Learning Operations (MLOps) Architect - Generative Al Focus

What You’ll Do

  • MLOps & GenAI Platform Architecture
    • Design and implement scalable ML and LLM infrastructure on AWS (Sage Maker, EKS, S3, IAM, Lambda, Step Functions, Cloud Watch).
    • Architect end-to-end ML and Generative AI lifecycle workflows:
      • Data ingestion & preprocessing
      • Feature engineering / embedding generation
      • Model training & fine‑tuning (traditional ML + foundation models)
      • Model evaluation & validation
      • Deployment (real‑time, batch, streaming)
      • Monitoring & retraining
    • Integrate LLM pipelines (prompt workflows, RAG architectures, fine‑tuning flows) into the enterprise MLOps stack.
    • Define standards for CI/CD/CT pipelines across ML and GenAI workloads.
  • Generative AI & LLM Operationalization
    • Architect Retrieval‑Augmented Generation (RAG) pipelines including:
      • Embedding generation workflows
      • Vector database integration
      • Document ingestion and chunking strategies
      • Retrieval evaluation and monitoring
    • Design and deploy LLM‑based services using:
      • Managed services (e.g., Sage Maker endpoints, Bedrock‑style APIs)
      • Containerized custom inference services
    • Establish prompt versioning, evaluation frameworks, and experiment tracking for LLM systems.
    • Implement guardrails for hallucination control, safety monitoring, bias detection, and usage logging.
    • Define architecture for LLM fine‑tuning workflows (including data curation, evaluation, and cost controls).
    • Implement scalable orchestration of LLM pipelines using workflow engines and event‑driven patterns.
  • Deployment, Monitoring & Reliability
    • Architect scalable inference patterns for:
      • Traditional ML models
      • LLM APIs
      • RAG systems
    • Implement model monitoring frameworks for:
      • Performance degradation
      • Drift detection
      • LLM output quality
      • Latency and token usage metrics
    • Define SLAs/SLOs for ML and GenAI systems.
    • Design safe deployment strategies (blue/green, canary, shadow testing).
    • Establish logging, observability, and traceability standards for GenAI systems.
  • Fin Ops & Cost Optimization
    • Implement cost tracking for:
      • Training workloads (GPU utilization)
      • Inference endpoints (Token consumption for LLM APIs)
      • Vector database storage
    • Optimize LLM workloads for cost-performance tradeoffs (model size, batching, caching strategies).
    • Design autoscaling and compute optimization strategies for GPU and CPU‑based inference.
    • Partner with finance and engineering teams to forecast ML/GenAI infrastructure spend.
  • Platform Enablement & Standards
    • Define enterprise standards for:
      • Experiment tracking
      • Model registry
      • Prompt registry
      • Artifact management
      • Embedding versioning
    • Provide architectural guidance to data science, AI, and engineering teams.
    • Evaluate and recommend tooling across the ML/GenAI stack (MLflow, feature stores, vector databases, orchestration tools).
    • Drive documentation and reusable patterns for ML and GenAI development.
What We’re Looking For
  • 6+ years of experience in ML engineering, data engineering, or MLOps roles.
  • Proven experience architecting ML platforms in AWS.
  • Strong hands‑on experience with Sage Maker (training, pipelines, deployment).
  • Experience operationalizing LLM or Generative AI systems in production.
  • Experience building RAG pipelines and integrating vector databases.
  • Experience working with Databricks in production.
  • Experience implementing data governance and catalog systems (e.g., Atlan).
  • Strong understanding of CI/CD principles for ML and GenAI.
  • Experience with containerization (Docker) and orchestration (Kubernetes/EKS).
  • Deep knowledge of infrastructure‑as‑code (Terraform, Cloud Formation).
  • Strong understanding of observability and monitoring for ML systems.
  • Experience implementing cloud cost optimization strategies (Fin Ops).
  • Strong Python proficiency.
  • Experience with foundation model fine‑tuning and parameter‑efficient methods.
  • Experience implementing model registries and experiment tracking tools.
  • Experience designing feature stores and embedding stores.
  • Familiarity with AI risk management, bias mitigation, and safety controls.
  • Experience supporting regulated or data‑sensitive environments.
  • Platform‑level architectural thinking.
  • Deep understanding of how to integrate GenAI into enterprise ML ecosystems.
  • Ability to balance scalability, governance, security, performance, and cost.
  • Strong technical leadership and cross‑functional…
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
 
 
 
Search for further Jobs Here:
(Try combinations for better Results! Or enter less keywords for broader Results)
Location
Increase/decrease your Search Radius (miles)
0
200
Filters
Education Level
Experience Level (years)
Posted in last:
Salary