More jobs:
Site Reliability Engineer
Job in
Berkeley, Alameda County, California, 94720, USA
Listed on 2026-06-03
Listing for:
Lawrence Berkeley National Laboratory
Full Time
position Listed on 2026-06-03
Job specializations:
-
IT/Tech
Systems Engineer, Cloud Computing, IT Support, Cybersecurity
Job Description & How to Apply Below
We're here for the same mission, to bring science solutions to the world. Join our team and YOU will play a supporting role in our goal to address global challenges! Have a high level of impact and work for an organization associated with 17 Nobel Prizes!
Why join Berkeley Lab?
We invest in our employees by offering a total rewards package you can count on:
- Exceptional health and retirement benefits, including pension or 401K-style plans
- Opportunities to grow in your career - check out our Tuition Assistance Program
- A culture where you'll belong - we are invested in our teams!
- In addition to accruing vacation and sick time, we also have a Winter Holiday Shutdown every year.
- Parental bonding leave (for both mothers and fathers)
- Pet insurance
- Work a 5-day schedule with 2-3 onsite operations shifts and 2-3 project days, rotating across day, swing, and overnight shifts as needed to monitor the NERSC HPC facility.
- Monitor and respond to system, storage, network, and facility alerts, escalating issues when necessary.
- Improve reliability through automation, process optimization, monitoring enhancements, and root-cause prevention.
- Develop and maintain monitoring, alerting, and diagnostic tools, including integrations with HPC system APIs and Service Now.
- Support 24/7 data collection and real-time diagnostics across critical infrastructure.
- Contribute to Agentic AI solutions that automate workflows and improve operational efficiency.
- Coordinate with NERSC teams on maintenance, workflows, and incident management.
- Perform physical and logical data center inspections to ensure environmental and infrastructure health.
- Maintain accurate incident and maintenance records in the ticketing system.
- Analyze and resolve complex operational issues using sound technical judgment and collaboration with internal and external experts.
- Typically requires a minimum of 5 years of related experience with a Bachelor's degree; or 3 years and a Master's degree; or equivalent work experience.
- Experience in or willingness to work within a 24/7 onsite team environment to support large-scale data centers or critical installations.
- Experience on Linux shell and working in a command-line (e.g. SSH) environment.
- Experience with developing tools using various programming languages such as C, C++, Perl, Java, or Python or a scripting language with knowledge of standard software development practices.
- Motivated, self-starter who can learn technologies that improve data center management in areas like Kubernetes, Prometheus/Victoria Metrics, Alert manager, building management software, evaporative cooling, and power utilization.
- Experience with network security: configuring/maintaining ACLs, knowledge of firewalls
- Experience collaborating across technical teams to resolve operational bottlenecks and ensure system reliability and alignment with service-level objectives.
- Knowledge of and ability to work on large data communications networks/ Network Protocols and IT infrastructure supporting highly available systems and applications.
- Experience with Service Now implementation is a plus, particularly in architecting or deploying solutions for Incident Management, Change Management, or CMDB to improve IT workflows.
- Practical experience in developing and deploying Agentic AI or autonomous automation tools to streamline technical tasks.
- Familiarity with ITSM best practices and an understanding of how to align service life cycles with business goals is preferred.
- A certification in a system administration area in platforms, software, or any other advanced education in the Computing Science area.
- Service Now certifications.
- ITIL certifications.
- Applications will be accepted until the job posting…
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
Search for further Jobs Here:
×