AI Integration Engineer Job McLean area,Virginia USA,IT/Tech

Job Number: R0238798

AI Integration Engineer

Overview

We are seeking a highly motivated AI Integration Engineer to design, deploy, and maintain the infrastructure that supports artificial intelligence systems, including large language models (LLMs) and distributed AI workloads. This role bridges advanced AI models, compute infrastructure, and operational workflows. You will manage AI readiness by architecting scalable infrastructure solutions, integrating complex systems, and maintaining operational excellence to ensure stable deployments of AI and machine learning applications.

The ideal candidate has a strong background in high‑performance computing, cloud infrastructure, MLOps or Dev Ops, and AI ecosystem integration.

Responsibilities

Serve as the technical point of contact for integrating LLMs and other AI workloads across infrastructure systems, operational tools, and application pipelines.
Architect, deploy, and maintain scalable GPU computing environments and infrastructure required for autonomous agentic workflows, including persistent state management, long‑term memory systems such as Vector DBs and multi‑step reasoning traces.
Develop, manage, and optimize CI/CD pipelines for AI deployments, ensuring smooth transitions from model development to production environments.
Oversee network and infrastructure connectivity, ensuring seamless communication between distributed systems, GPUs, virtual machines (VMs), APIs, and Command and Control (C2) tools.
Design and secure tool‑calling environments where agents interact with external APIs, ensuring strict governance and sandboxing for autonomous actions.
Provide diagnostic and troubleshooting expertise for AI systems, monitoring infrastructure to maintain availability, security, and performance benchmarks.
Collaborate across engineering, data, and AI teams to align infrastructure solutions with business and operational goals.

Qualifications

5+ years of experience in infrastructure engineering or system integration roles.
2+ years of experience supporting large‑scale AI/ML systems or GPU‑centric environments.
Experience with cloud platforms such as AWS, Azure, or Google Cloud and their AI‑focused services, including Sage Maker, GCP AI Platform, and Azure Machine Learning.
Experience with networking concepts, including TCP/IP, DNS, NGINX, load balancing, and firewalls, applied to AI model and infrastructure deployment.
Experience integrating MLOps pipelines using tools such as MLflow, Kubeflow, Tensor Flow Serving, or Vertex AI, including Agent Ops frameworks such as Lang Smith and Arize Phoenix to monitor autonomous decision‑making paths and agent reasoning traces.
Experience with orchestration frameworks for multi‑agent systems such as Lang Graph, CrewAI, or Auto Gen, and managing the stateful databases required to support them, including Redis and Postgres.
Experience working with NVIDIA GPU technologies, including CUDA, NCCL, Tensor

RT, and DGX systems, and container or orchestration tools such as Kubernetes, Docker, Terraform, or Pulumi.
Ability to manage and optimize distributed high‑performance computing environments, including clusters of GPUs and cloud‑based GPU instances.
TS/SCI clearance with a polygraph.
Bachelor's degree in CS, Computer Engineering, or Systems Engineering.

Nice to Have

Experience with AI/ML frameworks for model training and deployment such as PyTorch, Tensor Flow, or Hugging Face Transformers.
Experience implementing observability and monitoring systems such as Grafana, Prometheus, and ELK for AI infrastructure to track performance and operational health.
Experience with security practices for AI systems, including encryption, role‑based access controls, secure APIs, and compliance frameworks such as SOC 2 and GDPR.
Experience with Agentic Safety, including implementation of Human‑in‑the‑Loop (HITL) approval gateways and automated kill switches for autonomous processes.
Experience with Vector Database infrastructure such as Pinecone, Weaviate, or Milvus, and Retrieval‑Augmented Generation (RAG) pipelines used to provide agents with contextual memory.
Knowledge of distributed computing frameworks such as Ray, Horovod, or Dask for AI…