Observability Lead - Cloud SRE & Network Reliability Job Fremont area,California USA,IT/Tech

## The group you'll be a part of

The Global Information Systems Group is dedicated to the success of Lam through providing best-in-class and innovative information system solutions and services. Together, we support users globally with data, information, and systems to achieve their business objectives.

## The impact you'll make

Our team at Lam is seeking a hands-on Observability Lead with a strong Site Reliability Engineering (SRE) and multi-cloud networking foundation to join our GIS Infrastructure Platform Engineering team. You will lead engineers in delivering robust observability frameworks, SLA/SLO/SLI disciplines, DR/BCP programs, backup and restore operations, and end-to-end network reliability across Azure, AWS, and GCP. You will own the full-stack delivery of observability, reliability, and resilience capabilities across a global multi-cloud enterprise.

## What you'll do

* Lead and grow a team delivering a world-class observability platform across global, multi-cloud production environments, including Azure, AWS, and GCP.

* Define and enforce SLA, SLO, and SLI frameworks across all infrastructure and network domains, driving continuous improvement through effective error budget management.

* Own end-to-end multi-cloud network observability, including VNet and VPC traffic flows, Transit Gateway routing, BGP peering health, and inter-region connectivity.

* Design and govern multi-cloud networking architectures, including Azure VNet, AWS VPC and Transit Gateway, GCP VPC, and hybrid connectivity solutions such as Express Route,

* Direct Connect, and Cloud Interconnect.

* Design and implement agentic AI workflows using LLM-based agents, RAG patterns, and orchestration frameworks to enable AIOps-driven fault detection and remediation.

* Own disaster recovery (DR) and business continuity planning (BCP) strategy, including runbook authorship, multi-cloud failover validation, and periodic DR drills to ensure RTO and RPO commitments are met.

* Lead backup and restore operations across multi-cloud and hybrid environments, incorporating automated validation and cross-cloud recovery workflows.

* Build robust monitoring and alerting pipelines by integrating Prometheus, Grafana, Datadog, Pager Duty, Thousand Eyes, Azure Monitor, Cloud Watch, and Google Cloud Operations into a unified observability stack.

* Drive automation-first practices through self-healing pipelines, remediation playbooks, and infrastructure-as-code (IaC) patterns to reduce toil and improve MTTR.

* Lead P1, P2, and P3 incident response efforts, including structured post-mortems and action tracking.

* Define and drive the multi-quarter roadmap for observability, reliability, networking, DR/BCP, and AI-assisted operations.

* Support hiring, performance management, and career development for the team.

## Who we're looking for

* A BS, MS, or PhD in Computer Science, Engineering, or a related field (or equivalent experience), with 12+ years of overall experience in Infrastructure, SRE, Dev Ops, or Network

* Engineering and 6+ years of experience leading high-performing SRE, Observability, or Platform Engineering teams.

* Proven expertise in defining, enforcing, and operating SLA, SLO, and SLI frameworks, including effective error budget management.

* Hands-on experience with disaster recovery (DR) and business continuity planning (BCP), including RTO/RPO planning, failover testing, and continuity documentation.

* Deep expertise in backup and restore operations across multi-cloud and hybrid environments.

* Strong multi-cloud networking skills across Azure (VNet, Express Route, Virtual WAN), AWS (VPC, Transit Gateway, Direct Connect), and GCP (VPC, Cloud Interconnect, VPC-SC).

* Experience building and operating observability platforms, including tools such as Prometheus, Grafana, Datadog, Pager Duty, Thousand Eyes, Splunk, or equivalent solutions, with a focus on network telemetry and flow analysis.

* Deep expertise in automation, including Ansible, Terraform, Python, and self-healing infrastructure pipelines.

* Hands-on experience with infrastructure as code (IaC), CI/CD pipelines, Kubernetes (AKS, EKS, GKE), and all three major cloud platforms.

* Strong programming skills in Python or Go for tooling, automation, and system integrations.

* Experience leading P1, P2, and P3 incident management, including ITSM integration (Service Now preferred).

* Exceptional communication skills, with the ability to translate complex technical concepts into clear business value for engineering, product, and executive stakeholders.

## Preferred qualifications

* Experience with AIOps, including AI-assisted network fault detection, anomaly correlation, and auto-remediation.

* Familiarity with agentic AI workflows, including LLM-based agents and RAG patterns, applied to observability and operational use cases.

* Background in global WAN architectures, including MPLS and resilience strategies for multi-region enterprise environments.

* Experience with compliance-driven disaster recovery and business…