Site Reliability Engineer, Staff Engineer
Listed on 2026-02-15
-
IT/Tech
SRE/Site Reliability, Cloud Computing, Systems Engineer, IT Support
Company Description
We are a Digital Product Engineering company that is scaling in a big way! We build products, services, and experiences that inspire, excite, and delight. We work at scale — across all devices and digital mediums, and our people exist everywhere in the world (18000+ experts across 37 countries, to be exact). Our work culture is dynamic and non-hierarchical. We are looking for great new colleagues.
That is where you come in!
The Site Reliability Engineer (SRE) will provide L2/L3 support for AWS cloud infrastructure and production environments, ensuring high availability, reliability, and operational efficiency. This role focuses on automating operational tasks, monitoring systems, and collaborating with Dev Ops, Development, and Infrastructure teams to resolve issues and improve service performance.
Responsibilities- Provide L2/L3 support for AWS cloud infrastructure and production environments.
- Implement and maintain automation for operational tasks, deployments, and monitoring.
- Monitor system health, troubleshoot incidents, and ensure high availability of services.
- Develop and enhance scripts/tools to reduce manual effort and improve efficiency.
- Work closely with Dev Ops, Development, and Infrastructure teams for issue resolution.
- Participate in on-call rotations and incident management during US shift hours.
- Maintain and improve monitoring, alerting, and logging systems.
- Ensure adherence to SRE best practices for reliability, scalability, and performance.
- Document runbooks, SOPs, and knowledge base articles.
- Strong hands-on experience with AWS services (EC2, S3, RDS, Lambda, VPC, IAM, Cloud Watch).
- Experience in automation and scripting using Python, Shell, or Power Shell.
- Familiarity with Infrastructure as Code tools (Terraform or Cloud Formation).
- Understanding of CI/CD pipelines and Dev Ops practices.
- Experience with monitoring tools like Cloud Watch, Grafana, Prometheus, or ELK.
- Good understanding of Linux systems and networking concepts.
- Exposure to containerization (Docker/Kubernetes).
- Ability to troubleshoot production issues under pressure.
- Excellent verbal and written communication skills.
- Willingness to work in the US time zone shift.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).