Sr. Site Reliability Engineer
Listed on 2026-01-01
-
IT/Tech
Cloud Computing, Systems Engineer
Job Overview
Site Reliability Engineers (SREs) at Coupang are a mission‑critical role that combines software and system engineering to build, run and scale our complex, large‑scale e‑commerce systems. As part of the Site Reliability Engineering team you will be responsible for ensuring all customer‑facing services are healthy, monitored, automated, and designed to scale. We take pride in handling the "operations as an engineering" problem with an automation‑first approach.
You will build best‑in‑class infrastructure automation for Observability, Incident Management, Disaster Recovery, Load Testing, Capacity Engineering, and more. You will work closely with product development teams from early design through to production incidents, maintain SLI/SLA bars, and influence design with SRE principles and best practices. If you take pride in ownership, enjoy solving complex technical challenges for large‑scale distributed systems, and communicate effectively across boundaries, this is the role for you.
- Serve as the primary point responsible for the platform reliability, health, and performance of all Coupang customer‑facing services.
- Gain deep knowledge of Coupang application workflows and dependencies.
- Define and track KPIs and SLOs related to system availability, performance, and reliability.
- Build world‑class incident management processes and automation, including fast incident remediation, operational reviews, and retrospectives.
- Develop and implement best practices for creating, scaling, and maintaining effective monitoring, alerting, and telemetry systems.
- Build automation to execute regular Disaster Recovery, Chaos, and load testing to stay ahead of growth.
- Work closely with product teams to ensure designs incorporate scale and operability.
- Build guardrails and automation for deploying production changes while holding the reliability bar.
- Participate in a 24x7 rotation for production issue escalations, functioning well in a fast‑paced environment.
- Communicate effectively with stakeholders at all levels of the organization.
- Bachelor's degree in computer science, engineering, or a related technical field.
- 8+ years of industry experience building and operating large‑scale distributed systems.
- Prior experience with AI/ML, large‑scale web‑based Java architectures, and JVM configuration.
- Professional certifications in cloud platforms, monitoring tools, or related technologies.
- Previous experience working on a large‑scale GPU/Cloud Infrastructure platform.
- SLO/SLA management and implementation experience.
- Deep UNIX/Linux systems knowledge and administration background.
- Demonstrated programming skills in Python, Java, Golang, or Ruby.
- Strong problem‑solving and analytical skills across systems, network (TCP/IP), and code.
- Experience with cloud‑based GPU infrastructure (AWS, Azure, or GCP).
- Strong understanding of Dev Ops and SRE practices, including CI/CD and IaC.
- Experience with containerization and orchestration technologies such as Docker and Kubernetes.
- Excellent communication and collaboration skills, with the ability to work across distinct technical domains.
- Knowledge of the open telemetry observability ecosystem, including metrics, logging, tracing, and tools such as Prometheus, Grafana, Elastic Stack, Datadog, or New Relic.
Our compensation reflects the cost of labor across several U.S. geographic markets. At Coupang, your base pay is one part of your total compensation. The base pay for this position ranges from $176,000 per year in our lowest geographic market to $221,000 per year in our highest geographic market. Pay is based on several factors including market location and may vary depending on job‑related knowledge, skills, and experience.
GeneralDescription of All Benefits
- Medical/Dental/Vision/Life, AD&D insurance
- Flexible Spending Accounts (FSA) and Health Savings Account (HSA)
- Long‑term/Short‑term Disability
- Employee Assistance Program (EAP)
- 401(k) plan with company match
- 18‑21 days of paid time off (PTO) a year based on tenure
- 12 public holidays
- Paid parental leave
- Pre‑tax commuter benefits
- Electric car charging station (MTV – Free)
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).