Lead Data SRE; India
Listed on 2026-04-23
-
IT/Tech
Cloud Computing, Data Engineer
Job Description
The Data SRE Lead is responsible for ensuring the reliability, scalability, performance, and operational excellence of the organization’s data platforms and pipelines. This role bridges Data Engineering and Site Reliability Engineering practices, applying SRE principles to modern data ecosystems (batch, streaming, warehousing, and ML data infrastructure). This role has a potential to be remote, but it is highly preferred to sit hybrid in Chennai, India to support the team locally.
Key Responsibilities Reliability & Operations- Define and own SLIs, SLOs, and SLAs for data platforms and pipelines
- Design and implement monitoring, alerting, and observability solutions
- Lead incident response, root cause analysis (RCA), and postmortems
- Reduce toil through automation and self-healing infrastructure
- Ensure high availability of:
- Data warehouses and lake houses
- Streaming systems
- ETL/ELT pipelines
- Orchestration frameworks
- Implement capacity planning and performance tuning strategies
- Improve data pipeline reliability, freshness, and latency metrics
- Manage infrastructure-as-code (IaC) frameworks
- Improve CI/CD pipelines for data workflows
- Implement automated testing and validation for data infrastructure
- Drive resilience patterns such as retries, circuit breakers, and graceful degradation
- Lead and mentor a team of Data SREs
- Define operational standards and reliability roadmaps
- Collaborate cross-functionally with Data, Engineering, and Product leadership
- Drive a culture of reliability and operational excellence
We are a company committed to creating diverse and inclusive environments where people can bring their full, authentic selves to work every day. We are an equal opportunity/affirmative action employer that believes everyone matters. Qualified candidates will receive consideration for employment regardless of their race, color, ethnicity, religion, sex (including pregnancy), sexual orientation, gender identity and expression, marital status, national origin, ancestry, genetic factors, age, disability, protected veteran status, military or uniformed service member status, or any other status or characteristic protected by applicable laws, regulations, and ordinances.
If you need assistance and/or a reasonable accommodation due to a disability during the application or recruiting process, please send a request to To learn more about how we collect, keep, and process your private information, please review Insight Global's Workforce Privacy Policy:
- 8+ years in Site Reliability Engineering, Platform Engineering, or Data Engineering
- 3+ years in a technical leadership role
- Strong experience with:
- Cloud platforms (AWS, GCP, or Azure)
- Infrastructure as Code (Terraform, Cloud Formation)
- Monitoring tools (Prometheus, Datadog, Grafana)
- Containerization & orchestration (Docker, Kubernetes)
- Deep understanding of distributed systems and failure modes
- Experience supporting large-scale data systems (batch & streaming) Experience with modern data platforms (Snowflake, Big Query, Databricks)
- Experience with streaming systems (Kafka, Pub/Sub, Kinesis)
- Knowledge of data quality frameworks and data observability
- Familiarity with ML platform reliability
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).