Site Reliability Engineer Job Palo Alto area,California USA,IT/Tech

Quantum computing holds the promise of humanity's mastery over the natural world, but only if we can build a real quantum computer. Psi Quantum is on a mission to build the first real, useful quantum computers, capable of delivering the world‑changing applications that the technology has long promised. We know that means we will need to build a system with roughly 1 million qubits that supports fault‑tolerant error correction within a scalable architecture, and a data center footprint.

By harnessing the laws of quantum physics, quantum computers can provide exponential performance increases over today's most powerful supercomputers, offering the potential for extraordinary advances across a broad range of industries including climate, energy, healthcare, pharmaceuticals, finance, agriculture, transportation, materials design, and many more.

Psi Quantum has determined the fastest path to delivering a useful quantum computer, years earlier than the rest of the industry. Our architecture is based on silicon photonics which gives us the ability to produce our components at Tier‑1 semiconductor fabs such as Global Foundries where we leverage high‑volume semiconductor manufacturing processes, the same processes that are already producing billions of chips for telecom and consumer electronics applications.

We also benefit from the quantum‑mechanics reality that photons don’t feel heat or electromagnetic interference, allowing us to take advantage of existing cryogenic cooling systems and industry‑standard fiber connectivity.

In 2024, Psi Quantum announced two government‑funded projects to support the build‑out of our first Quantum Data Centers and utility‑scale quantum computers in Brisbane, Australia and Chicago, Illinois. Both projects are backed by nations that understand quantum computing’s potential impact and the need to scale this technology to unlock that potential. And we won’t just be building the hardware, but also the fault‑tolerant quantum applications that will provide industry‑transforming results.

Quantum computing is not just an evolution of the decades‑old advancement in compute power. It provides the key to mastering our future, not merely discovering it. The potential is enormous, and we have the plan to make it real. Come join us.

There’s much more work to be done and we are looking for exceptional talent to join us on this extraordinary journey!

Job Summary

Join the OS/Platform team as a Site Reliability Engineer (SRE) and keep our services healthy, observable, and fast. Partnering with the Platform Engineering group, you’ll own the day‑to‑day operation of our monitoring stack—Grafana, Prometheus, Loki, and Tempo—crafting dashboards that surface golden signals and drive real‑time insight. You’ll codify reliability through SLIs/SLOs, automate runbooks in Python, and lead incident response to maintain world‑class uptime across both on‑prem and AWS environments.

Responsibilities

• Define, implement, and iterate on Service Level Indicators & Service Level Objectives (SLIs/SLOs) and error budgets for critical services, with a focus on network reliability and data centre interconnects.

• Build and maintain Grafana dashboards that visualize golden signals (latency, traffic, errors, saturation), extending coverage to network telemetry such as packet loss, jitter, bandwidth utilization, and BGP/EVPN stability.

• Operate and tune the observability pipeline (Prometheus, Loki, Tempo) to ensure scalable, low‑latency telemetry ingestion and alerting for networking as well as compute layers.

• Drive incident response: triage, mitigate, perform post‑incident reviews, and implement preventive actions—particularly for network‑related outages, congestion, or misconfigurations.

• Develop automation and self‑service tooling in Python/Bash to streamline alerts, runbooks, and operational tasks, including network monitoring and diagnostics.

• Collaborate with Platform, Product, and Networking teams on capacity planning, performance testing, traffic engineering, and change management.

• Improve CI/CD health checks and release safety nets within Git Lab, with attention to network dependencies in deployments.

•…


Increase/decrease your Search Radius (miles)



Job Posting Language