More jobs:
Job Description & How to Apply Below
Site Reliability Engineer
Location:
Hyderabad
Notice Period:
Immediate to 20 Days
Employment Type:
Full Time
Experience
- 7–12 years in site reliability, cloud-based data infrastructure, data pipeline observability, automation, and high-availability engineering within EdTech platforms (2U)
- Primary Skills (Must-Have)
- AWS, CI/CD, Jenkins, IAAC, Terraform, Kubernetes
- Secondary Skills (Good-to-Have)
- AWS systems;
Dataiku data, Platform updates and patching
- Tools & Platforms
- Data Warehousing & Processing:
Snowflake, Redshift, Apache Airflow, dbt
- CI/CD & Deployment:
Jenkins, Git Hub Actions, AWS Code Pipeline, Terraform
- Cloud & Event Processing: AWS Lambda, API Gateway, SNS/SQS, Kafka, Step Functions
- Monitoring & Logging:
Data Dog, AWS Cloud Watch, Prometheus, Splunk
- Incident Management:
Pager Duty, Opsgenie, AWS Health Dashboard
- Collaboration & Code Review:
Git Hub, Jira, Confluence
Key Responsibilities
Data Pipeline Reliability & Observability:
- Maintain and optimize highly available, fault-tolerant infrastructure for data pipelines, ETL jobs, and real-time data processing
- Implement end-to-end monitoring of Airflow DAGs, Snowflake queries, and AWS-based data workflows
- Automate data pipeline health checks, error handling, and auto-remediation strategies
Infrastructure & Cloud Automation:
- Deploy and manage AWS-based data infrastructure using Terraform and Cloud Formation
- Optimize Kubernetes (EKS) clusters for processing large-scale datasets and real-time analytics
- Ensure high availability and cost-efficient scaling for Redshift, Snowflake, and data storage solutions
Performance, Monitoring & Incident Response:
- Implement real-time monitoring, logging, and alerting using Data Dog, AWS Cloud Watch, and Prometheus
- Define and track SLOs, SLIs, and error budgets to improve data reliability and uptime
- Conduct Root Cause Analysis (RCA), security audits, and post-mortems for incidents
Security & Compliance:
- Ensure GDPR, CCPA, and SOC 2 compliance for data storage, access controls, and retention policies
- Implement AWS security best practices (IAM, KMS, Shield, WAF) to secure data access and encryption
- Secure API gateways, authentication mechanisms, and data lake permissions to prevent unauthorized access
Collaboration & Leadership:
- Work closely with data engineers, analytics teams, and Dev Ops engineers to enhance data platform reliability
- Participate in incident response drills, disaster recovery planning, and security compliance reviews
- Advocate for best practices in automation, cost optimization, and cloud-native data solutions
Note that applications are not being accepted from your jurisdiction for this job currently via this jobsite. Candidate preferences are the decision of the Employer or Recruiting Agent, and are controlled by them alone.
To Search, View & Apply for jobs on this site that accept applications from your location or country, tap here to make a Search:
To Search, View & Apply for jobs on this site that accept applications from your location or country, tap here to make a Search:
Search for further Jobs Here:
×