Computer Scientist Job Noida area,Uttar Pradesh India,IT/Tech

Location:

Noida

Experience:

6–9 Years

Team: AI Platform Engineering

Role Overview

We are looking for an experienced Infrastructure Developer (6–9 years) to help design, build, and scale the platform that powers our most demanding ML training workloads. This is a hands-on engineering role where you will write production-grade code, drive meaningful technical initiatives, and contribute to the reliability of an infrastructure that thousands of GPU hours depend on every day.

You bring strong Kubernetes skills, solid networking fundamentals, a developer's mindset, and the ability to own projects end-to-end with limited supervision. You have operated systems at significant scale and are ready to step up into broader technical leadership.

About The Platform

You will be working on a cutting-edge platform designed to train and serve large-scale machine learning models. The platform supports everything from small-scale experimentation to large distributed training jobs running on GPU clusters with thousands of accelerators. It provides ML engineers and researchers with the tools to onboard, monitor, and scale their workloads — whether a lightweight prototype or a production-grade deep learning model powering real-world applications.

Key platform capabilities:

Dynamic GPU orchestration using Kubernetes with custom schedulers and resource topology awareness.
Training & inference workflows end-to-end pipeline support from data ingestion through model serving.
Observability & cost tracking full-stack visibility across compute, network, and storage layers.
Self-service developer tooling enabling high-velocity experimentation without platform bottlenecks.
Multi-cloud infrastructure primarily AWS with Azure/GCP expansion underway.

Your contributions will directly influence the reliability, scalability, and efficiency of this platform — and the speed at which AI teams can innovate.

What You'll Do

Build for scale Design and improve Kubernetes-native infrastructure that runs distributed GPU training jobs reliably and efficiently. You will own significant components and drive their evolution.
Lead focused initiatives Own meaningful projects end-to-end — write design docs, gather input from stakeholders, and deliver under realistic timelines, often collaborating with engineers across time zones.
Codify infrastructure Define and ship cloud infrastructure through IaC (Terraform/Pulumi). Apply the same rigor, testing, and review discipline to infra changes as to application code.
Strengthen observability Contribute to and extend deep observability stacks — metrics, distributed tracing, log aggregation, SLO/SLI frameworks — that surface problems before they become incidents.
Write production code Build automation, internal tooling, operators, and platform services in Go, Python, or Rust. This is not a YAML-only role.
Own reliability Participate in incident response, post-mortems, and reliability reviews. Drive systemic fixes, not just workarounds. Be a strong contributor to on-call culture.
Solve hard networking problems Debug and resolve complex cluster networking issues — CNI, BGP, service mesh, DNS at scale, east-west traffic, throughput tuning.
Mentor and grow Raise the technical bar through code reviews, design feedback, and knowledge sharing with peers and more junior engineers.

What You Bring

Core Requirements

Kubernetes & GPU Infrastructure

6–9 years in SRE, platform engineering, or infrastructure roles
Strong working knowledge of Kubernetes internals: scheduler, kubelet, CRDs, operators, admission controllers
Hands-on experience running GPU/accelerator training workloads in production
Familiarity with multi-cluster management and workload placement strategies
Helm, Kustomize, Git Ops (Flux/ArgoCD) — practical experience and good judgment on when to use them

Cloud & Infrastructure as Code

Solid hands-on AWS experience (VPC, EKS, EC2, S3, IAM; TGW a plus)
Production experience with Terraform or Pulumi — modular and tested
CI/CD for infrastructure: drift detection, plan gating, rollback strategies
Working understanding of cost optimization, reserved capacity, and spot instance management

Observability

Prometheus, Grafana, Alert Manager —…