Site Reliability Engineer Job Gurgaon area,Uttar Pradesh India,IT/Tech

Key Responsibilities
Platform Design & Architecture
Define and evolve the architecture of observability platform, integrating logs, metrics, traces, events, and alerts
Establish reference implementations and patterns for integrating observability into cloud-native and monolithic applications
Evaluate and integrate best-in-class tools for telemetry (e.g., Open Telemetry, Prometheus, New Relic, Grafana, Elastic, Splunk, etc.)
Governance & Standards
Define enterprise-wide observability standards and maturity models (instrumentation guidelines, SLOs/SLIs, retention policies)
Drive instrumentation consistency across services through libraries, SDKs, and developer onboarding assets
Embed observability standards into CI/CD pipelines, golden paths, and developer enablement frameworks
Platform Engineering & Operations
Build and maintain core observability infrastructure as internal platform services
Ensure observability platform is highly available, scalable, cost-optimized, and compliant with governance controls
Automate provisioning, onboarding, alerting configuration, and tenant lifecycle management for internal teams
Developer Enablement & Integration
Create self-service capabilities for developers and SREs:
Instrumentation kits
Dashboards and alert templates
Troubleshooting guides and observability sandboxes
Collaborate with Developer Experience and Platform teams to embed observability into the developer workflow and developer portal (Velocity)
Adoption & Support
Lead and support migration and onboarding efforts for application teams
Partner with GPS, ISS, and platform teams to define key use cases and integration paths
Define telemetry baselines and observability KPIs for portfolio-level measurement

Required:

6+ years of experience in Site Reliability Engineering, Platform Engineering, or Dev Ops roles
Deep understanding of observability concepts (logs, metrics, traces, events, SLOs, SLIs, RED/USE models)
Hands-on experience with one or more tools in the observability stack (Grafana, Elastic, Prometheus, Splunk, Datadog, Open Telemetry)
Strong scripting or automation skills (Python, Go, Bash, Terraform, etc.)
Familiarity with Kubernetes, container orchestration, and cloud-native environments (AWS/Azure)
Preferred:
Experience designing or operating an enterprise-wide observability platform
Exposure to multi-tenant observability systems, billing or usage metering
Knowledge of developer experience workflows and developer portals
Previous work with standards enforcement and governance-as-code


Increase/decrease your Search Radius (miles)



Job Posting Language