More jobs:
Site Reliability Engineer - System Service Global
Job in
San Jose, Santa Clara County, California, 95111, USA
Listed on 2026-06-01
Listing for:
ByteDance
Full Time
position Listed on 2026-06-01
Job specializations:
-
IT/Tech
Systems Engineer, Cloud Computing
Job Description & How to Apply Below
Our mission is to deliver efficient infrastructure solutions and a stable, secure system environment for Byte Dance's global business. We are looking for a self-motivated system engineer that is equipped with SRE mindset and Dev Ops skills. Your responsibilities will include:
- Manage and maintain large-scale host infrastructure across Byte Dance's non-China data centers, covering OS lifecycle management, configuration standardization, and fleet-wide health monitoring.
- Own the reliability and availability of core data center foundational services, including DNS, NTP, DHCP, NAT, APT repository, and Kerberos authentication.
- Design and implement deployment architectures for foundational services, ensuring high availability, fault tolerance, and disaster recovery across regions.
- Develop and enforce SLOs for managed services; lead incident response, root cause analysis, and post-mortem reviews to drive continuous reliability improvements.
- Collaborate with network, security, and application teams to ensure foundational services meet the evolving demands of global business growth.
- Identify automation opportunities across host management and service operations; drive tooling and process improvements to reduce toil and increase operational efficiency.
Minimum Qualifications:
- Bachelor's degree or higher in Electrical Engineering, Computer Engineering, Computer Science or related majors.
- Solid experience in large-scale Linux host management, including OS deployment, configuration management, patching, and fleet operations.
- Strong hands-on knowledge of core data center foundational services: DNS (BIND/Power
DNS), NTP, DHCP, NAT, APT repository management, and Kerberos.
- Proficiency with Dev Ops tooling, including configuration management tools (e.g., Ansible, Salt, Puppet) and CI/CD pipelines.
- Familiarity with SRE principles and practices, including SLO/SLI definition, error budget management, and blameless post-mortems.
- Solid understanding of high availability design patterns, active-active/active-passive architectures, and disaster recovery strategies.
- Strong troubleshooting skills across the Linux system stack and network layer.
Preferred Qualifications:
- Experience managing host fleets at scale (thousands of nodes or above) in a production environment.
- Scripting or development experience in Python, Go, or Bash for automation and tooling.
- Exposure to hybrid or multi-region data center environments.
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
Search for further Jobs Here:
×