×
Register Here to Apply for Jobs or Post Jobs. X

Site Reliability Engineer - System Service Global

Job in San Jose, Santa Clara County, California, 95111, USA
Listing for: ByteDance
Full Time position
Listed on 2026-06-01
Job specializations:
  • IT/Tech
    Systems Engineer, Cloud Computing
Job Description & How to Apply Below
The Global System Service team owns the infrastructure services and management solutions that power Byte Dance's data centers outside of China - from day-to-day operations to long-term architecture design and maintenance. The team specializes in composing end-to-end solutions by drawing on both open-source community tools and in-house developed products, tailored to both the business requirements and the operational complexities of large-scale infrastructure across Byte Dance's non-China regions.

Our mission is to deliver efficient infrastructure solutions and a stable, secure system environment for Byte Dance's global business. We are looking for a self-motivated system engineer that is equipped with SRE mindset and Dev Ops skills. Your responsibilities will include:

- Manage and maintain large-scale host infrastructure across Byte Dance's non-China data centers, covering OS lifecycle management, configuration standardization, and fleet-wide health monitoring.

- Own the reliability and availability of core data center foundational services, including DNS, NTP, DHCP, NAT, APT repository, and Kerberos authentication.

- Design and implement deployment architectures for foundational services, ensuring high availability, fault tolerance, and disaster recovery across regions.

- Develop and enforce SLOs for managed services; lead incident response, root cause analysis, and post-mortem reviews to drive continuous reliability improvements.

- Collaborate with network, security, and application teams to ensure foundational services meet the evolving demands of global business growth.

- Identify automation opportunities across host management and service operations; drive tooling and process improvements to reduce toil and increase operational efficiency.

Minimum Qualifications:

- Bachelor's degree or higher in Electrical Engineering, Computer Engineering, Computer Science or related majors.

- Solid experience in large-scale Linux host management, including OS deployment, configuration management, patching, and fleet operations.

- Strong hands-on knowledge of core data center foundational services: DNS (BIND/Power

DNS), NTP, DHCP, NAT, APT repository management, and Kerberos.

- Proficiency with Dev Ops tooling, including configuration management tools (e.g., Ansible, Salt, Puppet) and CI/CD pipelines.

- Familiarity with SRE principles and practices, including SLO/SLI definition, error budget management, and blameless post-mortems.

- Solid understanding of high availability design patterns, active-active/active-passive architectures, and disaster recovery strategies.

- Strong troubleshooting skills across the Linux system stack and network layer.

Preferred Qualifications:

- Experience managing host fleets at scale (thousands of nodes or above) in a production environment.

- Scripting or development experience in Python, Go, or Bash for automation and tooling.

- Exposure to hybrid or multi-region data center environments.
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
 
 
 
Search for further Jobs Here:
(Try combinations for better Results! Or enter less keywords for broader Results)
Location
Increase/decrease your Search Radius (miles)
0
200
Filters
Education Level
Experience Level (years)
Posted in last:
Salary