AI Ops Engineer
Listed on 2026-01-01
-
IT/Tech
AI Engineer, Cloud Computing
Description AI Ops Engineer (Exempt) Enterprise AI/ML Organization
At this time, we are unable to offer visa sponsorship for this position. Candidates must be legally authorized to work for any employer in the United States (or applicable country) on a full-time basis without the need for current or future immigration sponsorship.
OverviewWe are looking for an experienced AI Ops Engineer to support our AI and ML initiatives, including GenAI platform development, deployment automation, and infrastructure optimization. You will play a critical role in building and maintaining scalable, secure, and observable systems that power scalable RAG solutions, model training platforms, and agentic AI workflows across the enterprise.
Responsibilities- Design and implement CI/CD pipelines for AI and ML model training, evaluation, and RAG system deployment (including LLMs, vector
DB, embedding and reranking models, governance and observability systems, and guardrails). - Provision and manage AI infrastructure across cloud hyperscalers (AWS/GCP), using infrastructure-as-code tools
-strong preference for Terraform-. - Maintain containerized environments (Docker, Kubernetes) optimized for GPU workloads and distributed compute.
- Support vector database, feature store, and embedding store deployments (e.g., pg Vector, Pinecone, Redis, Featureform. Mongo
DB Atlas, etc). - Monitor and optimize performance, availability, and cost of AI workloads, using observability tools (e.g., Prometheus, Grafana, Datadog, or managed cloud offerings).
- Collaborate with data scientists, AI/ML engineers, and other members of the platform team to ensure smooth transitions from experimentation to production.
- Implement security best practices including secrets management, model access control, data encryption, and audit logging for AI pipelines.
- Help support the deployment and orchestration of agentic AI systems (Lang Chain, Lang Graph, CrewAI, Copilot Studio, Agent Space, etc).
- 4+ years of Dev Ops, AI Ops, or infrastructure engineering experience. Preferably with 2+ years in AI/ML environments.
- Hands‑on experience with cloud-native services (AWS Bedrock/Sage Maker, GCP Vertex AI, or Azure ML) and GPU infrastructure management.
- Strong skills in CI/CD tools (Git Hub Actions, ArgoCD, Jenkins) and configuration management (Ansible, Helm, etc).
- Proficient in scripting languages like Python, Bash,
-Go or similar is a nice plus. - Experience with monitoring, logging, and alerting systems for AI/ML workloads.
- Deep understanding of Kubernetes and container lifecycle management.
- Exposure to AI Ops tooling such as MLflow, Kubeflow, Sage Maker Pipelines, or Vertex Pipelines.
- Familiarity with prompt engineering, model fine‑tuning, and inference serving.
- Experience with secure AI deployment and compliance frameworks.
- Knowledge of model versioning, drift detection, and scalable rollback strategies.
- Ability to work with a high level of initiative, accuracy, and attention to detail.
- Ability to prioritize multiple assignments effectively. Ability to meet established deadlines.
- Ability to successfully, efficiently, and professionally interact with staff and customers.
- Excellent organization skills.
- Critical thinking ability ranging from moderately to highly complex.
- Flexibility in meeting the business needs of the customer and the company.
- Ability to work creatively and independently with latitude and minimal supervision.
- Ability to utilize experience and judgment in accomplishing assigned goals.
- Experience in navigating organizational structure.
- Standing/ Walking — minimal level
- Sitting — moderate to high level
- Lifting — up to 15 lbs.
- Visual Concentration — high level
- Work Environment — typical office environment.
Travel Required:
2%
Position Type and Expected Hours of Work:
Full Time
The above statement is intended to describe the general nature and level of work being performed. It is not intended to be an exhaustive list of responsibilities, duties and skills required.
#J-18808-Ljbffr(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).