More jobs:
Observability Lead - Cloud SRE & Network Reliability
Job in
Fremont, Alameda County, California, 94537, USA
Listed on 2026-06-10
Listing for:
LAM Research
Full Time
position Listed on 2026-06-10
Job specializations:
-
IT/Tech
Cloud Computing, Systems Engineer, Cybersecurity, IT Support
Job Description & How to Apply Below
The Global Information Systems Group is dedicated to the success of Lam through providing best-in-class and innovative information system solutions and services. Together, we support users globally with data, information, and systems to achieve their business objectives.
## The impact you'll make
Our team at Lam is seeking a hands-on Observability Lead with a strong Site Reliability Engineering (SRE) and multi-cloud networking foundation to join our GIS Infrastructure Platform Engineering team. You will lead engineers in delivering robust observability frameworks, SLA/SLO/SLI disciplines, DR/BCP programs, backup and restore operations, and end-to-end network reliability across Azure, AWS, and GCP. You will own the full-stack delivery of observability, reliability, and resilience capabilities across a global multi-cloud enterprise.
## What you'll do
* Lead and grow a team delivering a world-class observability platform across global, multi-cloud production environments, including Azure, AWS, and GCP.
* Define and enforce SLA, SLO, and SLI frameworks across all infrastructure and network domains, driving continuous improvement through effective error budget management.
* Own end-to-end multi-cloud network observability, including VNet and VPC traffic flows, Transit Gateway routing, BGP peering health, and inter-region connectivity.
* Design and govern multi-cloud networking architectures, including Azure VNet, AWS VPC and Transit Gateway, GCP VPC, and hybrid connectivity solutions such as Express Route,
* Direct Connect, and Cloud Interconnect.
* Design and implement agentic AI workflows using LLM-based agents, RAG patterns, and orchestration frameworks to enable AIOps-driven fault detection and remediation.
* Own disaster recovery (DR) and business continuity planning (BCP) strategy, including runbook authorship, multi-cloud failover validation, and periodic DR drills to ensure RTO and RPO commitments are met.
* Lead backup and restore operations across multi-cloud and hybrid environments, incorporating automated validation and cross-cloud recovery workflows.
* Build robust monitoring and alerting pipelines by integrating Prometheus, Grafana, Datadog, Pager Duty, Thousand Eyes, Azure Monitor, Cloud Watch, and Google Cloud Operations into a unified observability stack.
* Drive automation-first practices through self-healing pipelines, remediation playbooks, and infrastructure-as-code (IaC) patterns to reduce toil and improve MTTR.
* Lead P1, P2, and P3 incident response efforts, including structured post-mortems and action tracking.
* Define and drive the multi-quarter roadmap for observability, reliability, networking, DR/BCP, and AI-assisted operations.
* Support hiring, performance management, and career development for the team.
## Who we're looking for
* A BS, MS, or PhD in Computer Science, Engineering, or a related field (or equivalent experience), with 12+ years of overall experience in Infrastructure, SRE, Dev Ops, or Network
* Engineering and 6+ years of experience leading high-performing SRE, Observability, or Platform Engineering teams.
* Proven expertise in defining, enforcing, and operating SLA, SLO, and SLI frameworks, including effective error budget management.
* Hands-on experience with disaster recovery (DR) and business continuity planning (BCP), including RTO/RPO planning, failover testing, and continuity documentation.
* Deep expertise in backup and restore operations across multi-cloud and hybrid environments.
* Strong multi-cloud networking skills across Azure (VNet, Express Route, Virtual WAN), AWS (VPC, Transit Gateway, Direct Connect), and GCP (VPC, Cloud Interconnect, VPC-SC).
* Experience building and operating observability platforms, including tools such as Prometheus, Grafana, Datadog, Pager Duty, Thousand Eyes, Splunk, or equivalent solutions, with a focus on network telemetry and flow analysis.
* Deep expertise in automation, including Ansible, Terraform, Python, and self-healing infrastructure pipelines.
* Hands-on experience with infrastructure as code (IaC), CI/CD pipelines, Kubernetes (AKS, EKS, GKE), and all three major cloud platforms.
* Strong programming skills in Python or Go for tooling, automation, and system integrations.
* Experience leading P1, P2, and P3 incident management, including ITSM integration (Service Now preferred).
* Exceptional communication skills, with the ability to translate complex technical concepts into clear business value for engineering, product, and executive stakeholders.
## Preferred qualifications
* Experience with AIOps, including AI-assisted network fault detection, anomaly correlation, and auto-remediation.
* Familiarity with agentic AI workflows, including LLM-based agents and RAG patterns, applied to observability and operational use cases.
* Background in global WAN architectures, including MPLS and resilience strategies for multi-region enterprise environments.
* Experience with compliance-driven disaster recovery and business…
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
Search for further Jobs Here:
×