More jobs:
Job Description & How to Apply Below
AI PRE Engineer
Location:
Noida
Experience:
10-16 Years
AI PRE Engineer (Platform Reliability / Production Readiness Engineer)
The Role
An AI PRE Engineer ensures AI/ML platforms are production-ready, highly reliable, observable, secure, and cost-efficient, bridging AI engineering, SRE, Dev Ops, and MLOps disciplines.
Responsibilities:
Define and maintain production readiness standards across platform, data, model, application, and security layers.
Establish SLO/SLI frameworks for latency, availability, quality, safety, and drift implement error budget policies.
Publish reference architectures for LLM apps, RAG, vector stores, agent frameworks, and batch/stream inference.
Curate deployment blueprints (canary/shadow, blue–green, A/B) for models and prompts with rollback guidance.
Standardize observability patterns for prompts, embeddings, latency, cost, quality, and safety telemetry.
Own capacity engineering (token/concurrency budgets, GPU/CPU sizing, vector scaling, cache hierarchies).
Define resilience patterns (timeouts, circuit breakers, fallbacks, idempotent retries, semantic/prompt caching).
Set AI security baselines (secrets, private networking, egress controls) and mandate red‑team & safety evaluations.
Maintain compliance mappings (e.g., ISO 27001, SOC 2, GDPR/DPDP, HIPAA where applicable).
Provide CI/CD pipelines, SDKs, Helm/Terraform templates, and policy‑as‑code for consistent delivery.
Author PRR checklists, runbooks/playbooks, and DR/BCP blueprints (RTO/RPO, multi‑region/site failover). Drive enablement (trainings, brown-bags) and maintain knowledge repositories and decision records.
Partner with solution teams to validate architecture and non‑functional requirements (scale, latency, cost, safety).
Conduct Production Readiness Reviews (PRRs) and certify releases across performance, security, privacy, and compliance.
Implement observability (tracing, metrics, logs), dashboards, and SLO burn and cost anomaly alerting.
Experience with different IDE such as Jupiter Notebook, Visual Studio Code, PyCharm, etc.
Familiar with AI related libraries like Lang Chain, Pandas
AI, OpenAI
Execute safe releases (canary/shadow/blue green), prompt/model versioning, feature flags, and rollback plans.
Lead incident response for AI workloads; perform post‑incident reviews and drive systemic fixes.
Govern token/cost budgets, autoscaling thresholds, and vector store performance for Fin Ops efficiency.
Qualifications & Experience
Bachelor’s degree in computer science, Engineering, or Information Technology
Master’s degree in systems architecture, Cloud Computing, or AI‑related disciplines is preferred
9–14 years of overall IT or platform engineering experience
5–7 years designing or managing enterprise platforms (AI, data, or cloud platforms)
3–5 years in architecture or platform strategy roles supporting multiple teams or business units
Production readiness reviews, SLO/SLI/SLA design, incident management, RCA/postmortems, on-call support, and capacity planning for AI/ML platforms
Hands-on experience with AWS/GCP/Azure, GPU-aware infrastructure, Infrastructure as Code (Terraform), Docker, Kubernetes (EKS/GKE/AKS), and managing large-scale, multi-tenant clusters
Deploying ML/LLM workloads to production, model lifecycle management, RAG pipelines, safe rollouts (canary/shadow), rollback strategies, and managing inference scalability and latency
Metrics, logging, tracing, and alerting using Prometheus/Grafana/Open Telemetry or cloud-native tools; monitoring AI-specific signals such as model drift, latency, token usage, and GPU utilization
Strong coding (Python/Go/Java), CI/CD pipelines (Git Hub Actions, Jenkins), Git Ops, automated reliability tooling, security best practices (secrets management, access control, AI guardrails)
Certifications
Required:
NVIDIA Certified Professional: AI Infrastructure & Operations
NVIDIA DLI – Deploying AI with Kubernetes & GPUs
NVIDIA DLI – Building AI Infrastructure with NVIDIA Technologies
Certified Kubernetes Administrator
Docker Certified Associate
Red Hat Certified System Administrator (RHCSA)
Linux Foundation Certified System Administrator (LFCS)
Note that applications are not being accepted from your jurisdiction for this job currently via this jobsite. Candidate preferences are the decision of the Employer or Recruiting Agent, and are controlled by them alone.
To Search, View & Apply for jobs on this site that accept applications from your location or country, tap here to make a Search:
To Search, View & Apply for jobs on this site that accept applications from your location or country, tap here to make a Search:
Search for further Jobs Here:
×