Cloud Site Reliability Engineer - DCS Cloud
Listed on 2026-05-31
-
IT/Tech
Cloud Computing, Systems Engineer, SRE/Site Reliability
Responsibilities
Our Infrastructure Engineering team supports the company's fast growth by building and operating hyper-scale datacenters, managing the life cycle of server fleet, providing cloud solutions, and developing various infrastructure services to ensure they are scalable and reliable.
- Cloud Host Delivery, Delivery & Standardization
- Cloud Host Operation, Operation Efficiency & Reliability
- Cloud Management & Security
- Design, build, scale, and operate Byte Dance’s global infrastructure, including large-scale systems spanning public and private clouds.
- Develop tools, automation frameworks, visualizations, and monitoring systems to streamline operations and drive optimization of global infrastructure.
- Create, manage, and standardize cloud AMIs/images for use across multiple environments, ensuring strict alignment with the company's global compliance standards.
- Thrive in a fast-paced environment, engaging in technical operations and on-call rotations to address incidents related to cloud, OS, network, performance, and reliability.
- Drive improvements across the entire infrastructure lifecycle, from ideation and design through development, deployment, user support, and continuous refinement.
- Bachelor’s degree or above in Computer Science, Software Engineering, Information Security, or a related field.
- 2+ years of experience in Linux operations, SRE, or Dev Ops.
- Proficient in at least one programming language such as Go, Python, or C++, with solid engineering capabilities in platform development, system tooling, and automation.
- Strong computer science fundamentals, with deep understanding of Linux OS principles, computer networks, storage systems, GPU systems, and databases, along with systematic troubleshooting and root‑cause analysis skills.
- Familiar with core reliability practices, including monitoring and alerting, capacity management, change management, canary/gray releases, incident response, and postmortem processes.
- Strong communication and collaboration skills, with the ability to proactively identify problems, drive cross‑team execution, and demonstrate strong ownership and results‑oriented mindset.
- Hands‑on experience operating public cloud platforms, or deep familiarity with major cloud providers such as OCI, AWS, Azure, GCP, etc., including understanding of their underlying mechanisms.
- Experience with large‑scale cloud host delivery, image/AMI systems, resource scheduling, network adaptation, and virtualization technologies such as KVM/QEMU.
- Familiar with containers and cloud‑native ecosystems, including Docker, Kubernetes, and containerd, with a solid understanding of isolation mechanisms like cgroups and name spaces.
- Experience maintaining GPU clusters, including drivers, CUDA, MIG, topology awareness, troubleshooting, stress testing, and GPU delivery pipelines.
- Proven experience in reliability‑focused initiatives such as failure drill systems, capacity governance, change governance, observability platforms, and resource cost optimization.
- Open‑source contributions, technical blogs, patents, or technical sharing experience are highly preferred.
- Experience operating large‑scale production environments is a strong plus.
For Pay Transparency:
Compensation Description (Annually). The base salary range for this position in the selected city is $148,200 – $300,960 annually. Compensation may vary outside of this range depending on a number of factors, including a candidate’s qualifications, skills, competencies and experience, and location. Base pay is one part of the Total Package that is provided to compensate and recognize employees for their work, and this role may be eligible for additional discretionary bonuses/incentives, and restricted stock units.
Benefits may vary depending on the nature of employment and the country work location. Employees have day one access to medical, dental, and vision insurance, a 401(k) savings plan with company match, paid parental leave, short‑term and long‑term disability coverage, life insurance, wellbeing benefits, among others. Employees also receive 10 paid holidays per year, 10 paid sick days per year and 17 days of Paid Personal Time (prorated upon hire with increasing accruals by tenure).
The Company reserves the right to modify or change these benefits programs at any time, with or without notice.
Byte Dance is committed to providing reasonable accommodations in our recruitment processes for candidates with disabilities, pregnancy, sincerely held religious beliefs or other reasons protected by applicable laws. If you need assistance or a reasonable accommodation, please reach out to us at
#J-18808-Ljbffr(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).