Lead Data SRE; India Job Irvine area,California USA,IT/Tech

Position: Lead Data SRE (India)

Job Description

The Data SRE Lead is responsible for ensuring the reliability, scalability, performance, and operational excellence of the organization’s data platforms and pipelines. This role bridges Data Engineering and Site Reliability Engineering practices, applying SRE principles to modern data ecosystems (batch, streaming, warehousing, and ML data infrastructure). This role has a potential to be remote, but it is highly preferred to sit hybrid in Chennai, India to support the team locally.

Key Responsibilities Reliability & Operations

Define and own SLIs, SLOs, and SLAs for data platforms and pipelines
Design and implement monitoring, alerting, and observability solutions
Lead incident response, root cause analysis (RCA), and postmortems
Reduce toil through automation and self-healing infrastructure

Data Platform Stability

Ensure high availability of:
- Data warehouses and lake houses
- Streaming systems
- ETL/ELT pipelines
- Orchestration frameworks
Implement capacity planning and performance tuning strategies
Improve data pipeline reliability, freshness, and latency metrics

Infrastructure & Automation

Manage infrastructure-as-code (IaC) frameworks
Improve CI/CD pipelines for data workflows
Implement automated testing and validation for data infrastructure
Drive resilience patterns such as retries, circuit breakers, and graceful degradation

Leadership & Strategy

Lead and mentor a team of Data SREs
Define operational standards and reliability roadmaps
Collaborate cross-functionally with Data, Engineering, and Product leadership
Drive a culture of reliability and operational excellence

We are a company committed to creating diverse and inclusive environments where people can bring their full, authentic selves to work every day. We are an equal opportunity/affirmative action employer that believes everyone matters. Qualified candidates will receive consideration for employment regardless of their race, color, ethnicity, religion, sex (including pregnancy), sexual orientation, gender identity and expression, marital status, national origin, ancestry, genetic factors, age, disability, protected veteran status, military or uniformed service member status, or any other status or characteristic protected by applicable laws, regulations, and ordinances.

If you need assistance and/or a reasonable accommodation due to a disability during the application or recruiting process, please send a request to To learn more about how we collect, keep, and process your private information, please review Insight Global's Workforce Privacy Policy:

Skills and Requirements

8+ years in Site Reliability Engineering, Platform Engineering, or Data Engineering
3+ years in a technical leadership role
Strong experience with:
- Cloud platforms (AWS, GCP, or Azure)
- Infrastructure as Code (Terraform, Cloud Formation)
- Monitoring tools (Prometheus, Datadog, Grafana)
- Containerization & orchestration (Docker, Kubernetes)
- Deep understanding of distributed systems and failure modes
Experience supporting large-scale data systems (batch & streaming) Experience with modern data platforms (Snowflake, Big Query, Databricks)
Experience with streaming systems (Kafka, Pub/Sub, Kinesis)
Knowledge of data quality frameworks and data observability
Familiarity with ML platform reliability

#J-18808-Ljbffr