Observability Lead - Cloud SRE & Network Reliability Job Fremont area,California USA,IT/Tech

The group you’ll be a part of

The Global Information Systems Group is dedicated to the success of Lam through providing best-in-class and innovative information system solutions and services. Together, we support users globally with data, information, and systems to achieve their business objectives.

The impact you’ll make

Our team at Lam is seeking a hands‑on Observability Lead with a strong Site Reliability Engineering (SRE) and multi‑cloud networking foundation to join our GIS Infrastructure Platform Engineering team. You will lead engineers in delivering robust observability frameworks, SLA/SLO/SLI disciplines, DR/BCP programs, backup and restore operations, and end‑to‑end network reliability across Azure, AWS, and GCP. You will own the full‑stack delivery of observability, reliability, and resilience capabilities across a global multi‑cloud enterprise.

What

you’ll do

Lead and grow a team delivering a world‑class observability platform across global, multi‑cloud production environments, including Azure, AWS, and GCP.
Define and enforce SLA, SLO, and SLI frameworks across all infrastructure and network domains, driving continuous improvement through effective error budget management.
Own end‑to‑end multi‑cloud network observability, including VNet and VPC traffic flows, Transit Gateway routing, BGP peering health, and inter‑region connectivity.
Design and govern multi‑cloud networking architectures, including Azure VNet, AWS VPC and Transit Gateway, GCP VPC, and hybrid connectivity solutions such as Express Route, Direct Connect, and Cloud Interconnect.
Design and implement agentic AI workflows using LLM‑based agents, RAG patterns, and orchestration frameworks to enable AIOps‑driven fault detection and remediation.
Own disaster recovery (DR) and business continuity planning (BCP) strategy, including runbook authorship, multi‑cloud failover validation, and periodic DR drills to ensure RTO and RPO commitments are met.
Lead backup and restore operations across multi‑cloud and hybrid environments, incorporating automated validation and cross‑cloud recovery workflows.
Build robust monitoring and alerting pipelines by integrating Prometheus, Grafana, Datadog, Pager Duty, Thousand Eyes, Azure Monitor, Cloud Watch, and Google Cloud Operations into a unified observability stack.
Drive automation‑first practices through self‑healing pipelines, remediation playbooks, and infrastructure‑as‑code (IaC) patterns to reduce toil and improve MTTR.
Lead P1, P2, and P3 incident response efforts, including structured post‑mortems and action tracking.
Define and drive the multi‑quarter roadmap for observability, reliability, networking, DR/BCP, and AI‑assisted operations.
Support hiring, performance management, and career development for the team.

Who we’re looking for

A BS, MS, or PhD in Computer Science, Engineering, or a related field (or equivalent experience), with 12+ years of overall experience in Infrastructure, SRE, Dev Ops, or Network Engineering and 6+ years of experience leading high‑performing SRE, Observability, or Platform Engineering teams.
Proven expertise in defining, enforcing, and operating SLA, SLO, and SLI frameworks, including effective error budget management.
Hands‑on experience with disaster recovery (DR) and business continuity planning (BCP), including RTO/RPO planning, failover testing, and continuity documentation.
Deep expertise in backup and restore operations across multi‑cloud and hybrid environments.
Strong multi‑cloud networking skills across Azure (VNet, Express Route, Virtual WAN), AWS (VPC, Transit Gateway, Direct Connect), and GCP (VPC, Cloud Interconnect, VPC‑SC).
Experience building and operating observability platforms, including tools such as Prometheus, Grafana, Datadog, Pager Duty, Thousand Eyes, Splunk, or equivalent solutions, with a focus on network telemetry and flow analysis.
Deep expertise in automation, including Ansible, Terraform, Python, and self‑healing infrastructure pipelines.
Hands‑on experience with infrastructure as code (IaC), CI/CD pipelines, Kubernetes (AKS, EKS, GKE), and all three major cloud platforms.
Strong programming skills in Python or Go for tooling,…