Kubernetes AI Engineer
Listed on 2026-06-19
-
IT/Tech
Systems Engineer, Cloud Computing: Infrastructure & Operations, SRE/Site Reliability
Job Description
Insight Global is seeking a Kubernetes AI Engineer for a leading enterprise client in the technology space. This role will focus on designing, building, and optimizing scalable AI platform infrastructure that supports end-to-end machine learning workflows, including model development, training, and inference. The ideal candidate brings deep Kubernetes and containerization expertise, along with a strong foundation in Linux systems and platform engineering.
They will play a key role in driving performance, security, automation, and observability across enterprise AI environments, while partnering closely with cross-functional teams to deliver resilient, compliant, and high-performing platform solutions.
- Design and engineer Kubernetes-based AI platform infrastructure supporting model development, training, and inference
- Build and manage containerized environments using Docker, Kubernetes, Operators, and Helm
- Optimize cluster performance, workload placement, scalability, and resource utilization
- Implement enterprise-grade security and compliance controls across Kubernetes environments
- Develop and maintain observability frameworks (monitoring, logging, alerting, dashboards)
- Automate deployment, configuration, patching, and platform operations using scripting and IaC
- Collaborate with AI/ML engineers, data scientists, Dev Ops, and infrastructure teams to deliver scalable solutions
- Translate business and technical requirements into platform architecture
- Create and maintain architecture documentation, runbooks, and operational standards
- Troubleshoot and resolve platform issues in complex enterprise environments
We are a company committed to creating diverse and inclusive environments where people can bring their full, authentic selves to work every day. We are an equal opportunity/affirmative action employer that believes everyone matters. Qualified candidates will receive consideration for employment regardless of their race, color, ethnicity, religion, sex (including pregnancy), sexual orientation, gender identity and expression, marital status, national origin, ancestry, genetic factors, age, disability, protected veteran status, military or uniformed service member status, or any other status or characteristic protected by applicable laws, regulations, and ordinances.
If you need assistance and/or a reasonable accommodation due to a disability during the application or recruiting process, please send a request to To learn more about how we collect, keep, and process your private information, please review Insight Global's Workforce Privacy Policy:
- 5+ years of experience in Kubernetes, container platform engineering, Linux systems, or AI/ML infrastructure in enterprise environments
- Strong expertise in Kubernetes architecture, cluster deployment, orchestration, and lifecycle management
- Hands-on experience with Docker, container runtimes, Operators, and Helm charts
- Experience designing and engineering Kubernetes-based AI platform infrastructure for model development, training, and inference
- Strong Linux administration and troubleshooting skills across compute, networking, and storage layers
- Proficiency in Python, shell scripting, YAML, and JSON for automation and platform engineering
- Experience implementing security controls (RBAC, network policies, pod security, secrets management)
- Experience with observability and monitoring tools such as Prometheus, Grafana, and logging frameworks
- Experience optimizing clusters for performance, scalability, resiliency, and resource utilization
- Familiarity with CI/CD, Git Ops, infrastructure automation, and environment standardization practices
- Experience supporting AI/ML platforms, model deployment pipelines, or GPU-enabled workloads
- Strong problem-solving and analytical skills in complex environments
- Clear written and verbal communication skills
Strong collaboration skills across cross-functional teams (AI/ML, Dev Ops, Infrastructure) - Strong documentation discipline and operational rigor
- Self-starter mindset with accountability and continuous improvement focus
- Kubernetes certifications (CKA, CKAD)
- Understanding of vulnerability management, enterprise controls, and remediation practices
- Experience building agentic AI solutions in enterprise environments
- Familiarity with Copilot Studio for app development
- Experience with AI/ML frameworks and model-serving platforms (PyTorch, Tensor Flow, Triton, vLLM)
- Familiarity with GPU platforms and NVIDIA software stack
- Experience with Open Shift, hybrid cloud, or enterprise platform modernization initiatives
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).