More jobs:
Job Description & How to Apply Below
Noida
Experience:
6–9 Years
Team: AI Platform Engineering
Role Overview
We are looking for an experienced Infrastructure Developer (6–9 years) to help design, build, and scale the platform that powers our most demanding ML training workloads. This is a hands-on engineering role where you will write production-grade code, drive meaningful technical initiatives, and contribute to the reliability of an infrastructure that thousands of GPU hours depend on every day.
You bring strong Kubernetes skills, solid networking fundamentals, a developer's mindset, and the ability to own projects end-to-end with limited supervision. You have operated systems at significant scale and are ready to step up into broader technical leadership.
About The Platform
You will be working on a cutting-edge platform designed to train and serve large-scale machine learning models. The platform supports everything from small-scale experimentation to large distributed training jobs running on GPU clusters with thousands of accelerators. It provides ML engineers and researchers with the tools to onboard, monitor, and scale their workloads — whether a lightweight prototype or a production-grade deep learning model powering real-world applications.
Key platform capabilities:
Dynamic GPU orchestration using Kubernetes with custom schedulers and resource topology awareness.
Training & inference workflows end-to-end pipeline support from data ingestion through model serving.
Observability & cost tracking full-stack visibility across compute, network, and storage layers.
Self-service developer tooling enabling high-velocity experimentation without platform bottlenecks.
Multi-cloud infrastructure primarily AWS with Azure/GCP expansion underway.
Your contributions will directly influence the reliability, scalability, and efficiency of this platform — and the speed at which AI teams can innovate.
What You'll Do
Build for scale Design and improve Kubernetes-native infrastructure that runs distributed GPU training jobs reliably and efficiently. You will own significant components and drive their evolution.
Lead focused initiatives Own meaningful projects end-to-end — write design docs, gather input from stakeholders, and deliver under realistic timelines, often collaborating with engineers across time zones.
Codify infrastructure Define and ship cloud infrastructure through IaC (Terraform/Pulumi). Apply the same rigor, testing, and review discipline to infra changes as to application code.
Strengthen observability Contribute to and extend deep observability stacks — metrics, distributed tracing, log aggregation, SLO/SLI frameworks — that surface problems before they become incidents.
Write production code Build automation, internal tooling, operators, and platform services in Go, Python, or Rust. This is not a YAML-only role.
Own reliability Participate in incident response, post-mortems, and reliability reviews. Drive systemic fixes, not just workarounds. Be a strong contributor to on-call culture.
Solve hard networking problems Debug and resolve complex cluster networking issues — CNI, BGP, service mesh, DNS at scale, east-west traffic, throughput tuning.
Mentor and grow Raise the technical bar through code reviews, design feedback, and knowledge sharing with peers and more junior engineers.
What You Bring
Core Requirements
Kubernetes & GPU Infrastructure
6–9 years in SRE, platform engineering, or infrastructure roles
Strong working knowledge of Kubernetes internals: scheduler, kubelet, CRDs, operators, admission controllers
Hands-on experience running GPU/accelerator training workloads in production
Familiarity with multi-cluster management and workload placement strategies
Helm, Kustomize, Git Ops (Flux/ArgoCD) — practical experience and good judgment on when to use them
Cloud & Infrastructure as Code
Solid hands-on AWS experience (VPC, EKS, EC2, S3, IAM; TGW a plus)
Production experience with Terraform or Pulumi — modular and tested
CI/CD for infrastructure: drift detection, plan gating, rollback strategies
Working understanding of cost optimization, reserved capacity, and spot instance management
Observability
Prometheus, Grafana, Alert Manager —…
Note that applications are not being accepted from your jurisdiction for this job currently via this jobsite. Candidate preferences are the decision of the Employer or Recruiting Agent, and are controlled by them alone.
To Search, View & Apply for jobs on this site that accept applications from your location or country, tap here to make a Search:
To Search, View & Apply for jobs on this site that accept applications from your location or country, tap here to make a Search:
Search for further Jobs Here:
×