More jobs:
Senior Product Manager - Observability and Resilience
Job in
Santa Clara, Santa Clara County, California, 95053, USA
Listed on 2025-11-27
Listing for:
NVIDIA Corporation
Full Time
position Listed on 2025-11-27
Job specializations:
-
IT/Tech
Systems Engineer, Cloud Computing, IT Support, Cybersecurity
Job Description & How to Apply Below
This product manager will lead the development of foundational tools dedicated to ensuring the resiliency and observability of large-scale accelerated computing platforms. By creating essential tools for system diagnostics, performance monitoring, and automated recovery, they will empower customers to confidently operate both complex AI training and demanding inference workloads with maximum uptime and efficiency.
** What you will be doing:
*** Be a subject‑matter expert on resiliency and observability. Deeply understand failure modes across the GPU hardware, network, and software stack, along with the telemetry signals that reveal them, and how they correlate to workload health and SLOs. Master modern reliability architectures. Keep up-to-date with the industry trends.
* Build for all that want to use. Drive joint project planning. Define concrete achievements, tasks, and work for resiliency and observability initiatives with external partners.
* Fuel innovation in reliability tooling. Lead ideation sessions to propose novel approaches and shape new proof‑of‑concepts.
* Bridge development, SRE, and partner teams. Facilitate clear communication, triage emergent issues rapidly, and ensure feedback loops between engineering and customer operations remain tight.
* Coordinate execution across different functions. Work with engineering, design, operations, sales, and marketing to embed resiliency and observability requirements into every product launch, capacity expansion, and lifecycle transition.
** What we need to see:
*** BS or MS in Computer Science, Computer Engineering, or a related field (or equivalent experience) and 12+ years of product‑management experience in enterprise technology.
* Experience with GPU observability (DCGM, NVML, etc.) and integration into large‑scale telemetry systems.
* Deep knowledge of AI/ML infrastructure, high‑performance computing (HPC), networking, and cloud technologies (IaaS, PaaS) including containerization, Kubernetes, and automation tools.
* Familiarity with modern observability stacks: metrics, logs, traces, Open Telemetry, Prometheus/Grafana, ELK/Open Search.
* Experience building and preferably deep understanding of secure, compliance‑focused telemetry pipelines (SOC2, FedRAMP).
* Ability to articulate trade‑offs among latency, throughput, cost, and reliability to both engineering and executive audiences.
* Data-driven approach: defines SLIs/SLOs, manages error budgets, and develops value models.
* Strong cross‑functional execution: writes clear specs and PRDs, produces GTM collateral, and leads agile processes.
** Ways to stand out from the crowd:
*** Masters/Phd or Expertise in distributed systems, performance modeling, or fault‑tolerant computing.
* Experience with MLOps and LLMOps ecosystems and integrating with enterprise platforms; deployments at modern data‑center scale; delivered ML/AI observability solutions for LLMOps, predictive incident detection, or anomaly classification.
* Startup or 0 - 1 experience building cloud‑native observability or resilience tools; proven success bringing open‑source observability products to market and shaping GTM strategy.
* Familiarity with MLOps tool chains and integrations with monitoring platforms such as Splunk, Datadog, and Grafana Cloud.
* Expertise with containerization technologies like Docker and Kubernetes, plus virtualization. Proficiency in network architecture and high‑performance interconnects (Infini Band, Ethernet, RoCE).We have some of the most forward-thinking and hardworking people in the world working for us and, due to outstanding growth, our elite engineering teams are growing fast. NVIDIA is widely considered to be one of the industry's most desirable employers.
NVIDIA is at the center…
Position Requirements
10+ Years
work experience
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
Search for further Jobs Here:
×