SRE, Observability - Decentralized - Computing Leader
Listed on 2025-12-30
-
IT/Tech
Systems Engineer, Cloud Computing, IT Support, SRE/Site Reliability
Location: New York
SRE, Observability - Decentralized High-Performance Computing Leader
Senior / Staff Site Reliability Engineer – Observability & Telemetry Systems
AboutThe Role
We’re seeking an accomplished Site Reliability Engineer with deep expertise in large-scale observability systems to help shape and operate the monitoring backbone of a global AI cloud platform. You’ll design, build, and maintain the telemetry infrastructure that ensures the performance, reliability, and visibility of systems powering advanced machine learning and high-performance computing workloads around the world.
In this role, you’ll be the technical authority driving how metrics, logs, and traces are captured, processed, and visualized across a massive distributed environment. From optimizing cost efficiency at scale to ensuring rapid root‑cause analysis during incidents, you’ll be building the observability systems that keep mission‑critical AI workloads running smoothly and predictably.
What You’ll Do- Architect large‑scale observability systems: design and operate telemetry pipelines for metrics, logs, and traces using modern observability stacks (Prometheus, Mimir, Loki, Tempo, Grafana) at petabyte scale.
- Ensure reliability and efficiency: tune distributed telemetry systems for performance, cardinality control, and cost optimization while maintaining high availability across global deployments.
- Empower debugging and insight: build tools and frameworks that give developers deep visibility into distributed ML training, inference pipelines, and infrastructure performance.
- Collaborate cross‑functionally: partner with platform, SRE, and infrastructure teams to extend observability coverage for Kubernetes clusters, SLURM schedulers, and GPU‑based compute environments.
- Operational excellence: establish SLOs, alerting policies, and observability standards that reduce noise, streamline incident response, and strengthen reliability culture across teams.
- Automate at scale: develop clean, maintainable code in Go, Python, or Bash to extend observability tooling and automate operational workflows.
- 7+ years of total engineering experience, including at least 3 years building or operating large-scale observability or telemetry infrastructure (100M+ metric series, 10TB+/day logs).
- Proven expertise with the Grafana ecosystem — Prometheus, Mimir, Loki, Tempo, Grafana, and Alert manager — in production environments.
- Hands‑on proficiency with Kubernetes, including Helm, Kustomize, custom CRDs, and multi‑cluster federation.
- Experienced with Terraform (or Pulumi) and Infrastructure‑as‑Code best practices for hybrid or bare‑metal provisioning.
- Strong programming ability in Go (preferred), with additional experience in Python or Bash for automation, data collection, and controller development.
- Deep knowledge of Linux internals — cgroups, name spaces, networking, and file system performance — plus foundational TCP/IP and TLS expertise.
- Experienced in defining and enforcing SLOs, SLIs, and alerting mechanisms that align engineering focus with real user impact.
- Calm and methodical under pressure — you’ve led incident response efforts, authored postmortems, and driven systemic improvements afterward.
- Communicative and collaborative — able to explain complex systems clearly and influence peers in dynamic, cross‑functional environments.
- Instrumentation of GPU‑heavy or HPC clusters (NVIDIA A‑/H‑series, NVSwitch, DGX, RoCE, RDMA).
- Observability for distributed ML workloads managed by Slurm, Ray, or Kubernetes‑native batch schedulers.
- Hands‑on with eBPF, Cilium, or Hubble for high‑fidelity, low‑overhead network visibility.
- Experience deploying and migrating Open Telemetry across metrics, logs, and traces.
- Operating service meshes like Istio or Linkerd and managing telemetry pipelines built on Envoy.
- Managing observability across distributed or multi‑region environments (US/EU/APAC), optimizing for latency and cost.
- Implementing cost and resource monitoring using tools like Kubecost or Cloudability.
- Security observability overlap — integrating Falco, Guard Duty, or auditd into telemetry pipelines.
- Contributions to open‑source observability projects or…
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).