×
Register Here to Apply for Jobs or Post Jobs. X

Site Reliability Engineer

Job in Berkeley, Alameda County, California, 94709, USA
Listing for: Lawrence Berkeley National Laboratory
Full Time position
Listed on 2026-05-12
Job specializations:
  • IT/Tech
    Systems Engineer, Cloud Computing, IT Support, Cybersecurity
Salary/Wage Range or Industry Benchmark: 117132 - 197676 USD Yearly USD 117132.00 197676.00 YEAR
Job Description & How to Apply Below

About the role

The National Energy Research Scientific Computing Center (NERSC) is hiring a Site Reliability Engineer to help ensure its HPC and data systems remain reliable, secure, and accessible for 11,000 scientific users.

As part of a 24x7 operations team, you will use advanced monitoring and data systems to proactively maintain the health of NERSC's computing environment and support critical DOE scientific research.

Benefits
  • Exceptional health and retirement benefits, including pension or 401K-style plans
  • Opportunities to grow in your career - check out our Tuition Assistance Program
  • A culture where you'll belong - we are invested in our teams!
  • In addition to accruing vacation and sick time, we also have a Winter Holiday Shutdown every year
  • Parental bonding leave (for both mothers and fathers)
  • Pet insurance
Responsibilities
  • Work a 5-day schedule with 2-3 onsite operations shifts and 2-3 project days, rotating across day, swing, and overnight shifts as needed to monitor the NERSC HPC facility.
  • Monitor and respond to system, storage, network, and facility alerts, escalating issues when necessary.
  • Improve reliability through automation, process optimization, monitoring enhancements, and root‑cause prevention.
  • Develop and maintain monitoring, alerting, and diagnostic tools, including integrations with HPC system APIs and Service Now.
  • Support 24/7 data collection and real‑time diagnostics across critical infrastructure.
  • Contribute to Agentic AI solutions that automate workflows and improve operational efficiency.
  • Coordinate with NERSC teams on maintenance, workflows, and incident management.
  • Perform physical and logical data center inspections to ensure environmental and infrastructure health.
  • Maintain accurate incident and maintenance records in the ticketing system.
  • Analyze and resolve complex operational issues using sound technical judgment and collaboration with internal and external experts.
Qualifications
  • Typically requires a minimum of 5 years of related experience with a Bachelor's degree; or 3 years and a Master's degree; or equivalent work experience.
  • Experience in or willingness to work within a 24/7 onsite team environment to support large‑scale data centers or critical installations.
  • Experience on Linux shell and working in a command‑line (e.g. SSH) environment.
  • Experience with developing tools using various programming languages such as C, C++, Perl, Java, or Python or a scripting language with knowledge of standard software development practices.
  • Motivated, self‑starter who can learn technologies that improve data center management in areas like Kubernetes, Prometheus/Victoria Metrics, Alert manager, building management software, evaporative cooling, and power utilization.
  • Experience with network security: configuring/maintaining ACLs, knowledge of firewalls.
  • Experience collaborating across technical teams to resolve operational bottlenecks and ensure system reliability and alignment with service‑level objectives.
  • Knowledge of and ability to work on large data communications networks/Network Protocols and IT infrastructure supporting highly available systems and applications.
Desired skills/knowledge
  • Experience with Service Now implementation is a plus, particularly in architecting or deploying solutions for Incident Management, Change Management, or CMDB to improve IT workflows.
  • Practical experience in developing and deploying Agentic AI or autonomous automation tools to streamline technical tasks.
  • Familiarity with ITSM best practices and an understanding of how to align service life cycles with business goals is preferred.
  • A certification in a system administration area in platforms, software, or any other advanced education in the Computing Science area.
  • Service Now certifications.
  • ITIL certifications.
Additional information
  • Applications will be accepted until the job posting is removed.
  • Appointment type:
    This is a full‑time, career appointment, exempt (monthly paid) from overtime pay.
  • Salary range:
    The expected salary for this position is $131,760 - $161,064, which fits into the full salary of $117,132 - $197,676 depending upon the candidate's skills, knowledge, and abilities. This includes education,…
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
 
 
 
Search for further Jobs Here:
(Try combinations for better Results! Or enter less keywords for broader Results)
Location
Increase/decrease your Search Radius (miles)
0
200
Filters
Education Level
Experience Level (years)
Posted in last:
Salary