Site Reliability Engineer - System Service Global Job San Jose area,California USA,IT/Tech

The Global System Service team owns the infrastructure services and management solutions that power Byte Dance's data centers outside of China - from day-to-day operations to long-term architecture design and maintenance. The team specializes in composing end-to-end solutions by drawing on both open-source community tools and in-house developed products, tailored to both the business requirements and the operational complexities of large-scale infrastructure across Byte Dance's non-China regions.

Our mission is to deliver efficient infrastructure solutions and a stable, secure system environment for Byte Dance's global business. We are looking for a self-motivated system engineer that is equipped with SRE mindset and Dev Ops skills. Your responsibilities will include:

- Manage and maintain large-scale host infrastructure across Byte Dance's non-China data centers, covering OS lifecycle management, configuration standardization, and fleet-wide health monitoring.

- Own the reliability and availability of core data center foundational services, including DNS, NTP, DHCP, NAT, APT repository, and Kerberos authentication.

- Design and implement deployment architectures for foundational services, ensuring high availability, fault tolerance, and disaster recovery across regions.

- Develop and enforce SLOs for managed services; lead incident response, root cause analysis, and post-mortem reviews to drive continuous reliability improvements.

- Collaborate with network, security, and application teams to ensure foundational services meet the evolving demands of global business growth.

- Identify automation opportunities across host management and service operations; drive tooling and process improvements to reduce toil and increase operational efficiency.

Minimum Qualifications:

- Bachelor's degree or higher in Electrical Engineering, Computer Engineering, Computer Science or related majors.

- Solid experience in large-scale Linux host management, including OS deployment, configuration management, patching, and fleet operations.

- Strong hands-on knowledge of core data center foundational services: DNS (BIND/Power

DNS), NTP, DHCP, NAT, APT repository management, and Kerberos.

- Proficiency with Dev Ops tooling, including configuration management tools (e.g., Ansible, Salt, Puppet) and CI/CD pipelines.

- Familiarity with SRE principles and practices, including SLO/SLI definition, error budget management, and blameless post-mortems.

- Solid understanding of high availability design patterns, active-active/active-passive architectures, and disaster recovery strategies.

- Strong troubleshooting skills across the Linux system stack and network layer.

Preferred Qualifications:

- Experience managing host fleets at scale (thousands of nodes or above) in a production environment.

- Scripting or development experience in Python, Go, or Bash for automation and tooling.

- Exposure to hybrid or multi-region data center environments.