Principal SRE Job Delhi area,Delhi India,IT/Tech

We are looking for a Principal Site Reliability Engineer to join our dynamic Services team. In this role, you will contribute to the reliability and scalability of our cutting-edge platform, ensuring exceptional solutions tailored to our customers’ unique needs. This is a highly technical, hands-on role that requires deep expertise in system reliability and automation.

Key Responsibilities:

Reliability Engineering: Design and build automated systems that ensure the reliability and scalability of our Kubernetes clusters and Hydrolix deployments across multiple cloud platforms, eliminating manual operational tasks.
Automation and Efficiency :
Identify, quantify, and systematically eliminate repetitive manual work through automation and improved tooling, eliminating toil and freeing the team to focus on high-value work.
Observability Infrastructure :
Build and enhance comprehensive observability systems that provide deep visibility into system behavior, enable debugging and troubleshooting, and support data-driven reliability decisions
CI/CD and Deployment Automation :
Design and build robust CI/CD pipelines and deployment automation that enable safe, frequent releases with minimal human intervention.
Infrastructure Reliability :
Deploy, maintain, and ensure a highly reliable fleet of Kubernetes clusters and Hydrolix deployments across multiple cloud platforms.
Service Optimization :
Design, implement, and maintain systems and processes to enhance the reliability, availability, and performance of our services.
Root Cause Analysis :
Conduct comprehensive root cause analyses for system failures, implementing long-term preventive measures.
Collaboration and Customer Engagement
Cross-Functional Teamwork :
Work closely with software engineering, infrastructure, and product teams to integrate reliability practices into every stage of the development lifecycle.
Knowledge Sharing :
Document systems, create runbooks, and share knowledge across the organization to build collective expertise in reliability engineering.
Reliability Advocacy :
Champion SRE best practices and foster a culture of operational excellence across the organization.
Reliability Systems :
Build and maintain centralized reliability platforms, tools, and services that empower all engineering teams to operate their systems effectively.
Global Team Collaboration :
Collaborate with a distributed team of engineers worldwide to provide round-the-clock support and continuous improvement of our reliability posture.
Customer-Facing Reliability :
Work with customers to understand reliability requirements and ensure our platform meets their operational needs.

Qualifications and Skills:

SRE Expertise:
With a minimum 10+ years of proven experience as a Site Reliability Engineer, Dev Ops Engineer, or similar role, supporting large-scale, complex distributed systems in production.
Demonstrated ability to operate at a principal level by setting reliability direction, defining standards, and influencing system design across multiple teams.

Architecture, Performance & Scalability
Deep experience designing and evolving system architectures with reliability, scalability, and operability as first-class concerns.
In-depth experience in application and infrastructure performance tuning and scaling to handle heavy workloads under varying traffic patterns and failure scenarios.
Ability to identify systemic bottlenecks, capacity risks, and inefficiencies, and drive long-term architectural improvements.

Automation, Platform & Infrastructure Engineering
Exceptional track record of eliminating toil through automation, including building internal platforms or frameworks that enable safe, scalable self-service.
In-depth knowledge of configuration management and Infrastructure as Code (IaC) tools such as Terraform, Pulumi, and Ansible for provisioning and managing infrastructure consistently across environments.

Observability & Reliability Engineering
Deep expertise in observability tools and practices, with the ability to design end-to-end monitoring strategies aligned with business outcomes.
Strong understanding of core reliability concepts, including SLIs, SLOs, SLAs, error budgets,…


Increase/decrease your Search Radius (miles)



Job Posting Language