×
Register Here to Apply for Jobs or Post Jobs. X

Senior Site Reliability Engineer, AI​/ML

Job in Sunnyvale, Santa Clara County, California, 94087, USA
Listing for: Intuitive
Full Time position
Listed on 2025-12-30
Job specializations:
  • IT/Tech
    Cloud Computing, IT Support
Salary/Wage Range or Industry Benchmark: 178800 USD Yearly USD 178800.00 YEAR
Job Description & How to Apply Below
  • Ways of Working:
    Onsite - This job is fully onsite.
  • Employee Type:
    Employee
  • Min. Salary Region 1: 178800 USD
  • Global Job Level (HCM):
    Professional 4 (11)
  • Min. Salary Region 2: 151900 USD
Company Description

It started with a simple idea: what if surgery could be less invasive and recovery less painful? Nearly 30 years later, that question still fuels everything we do at Intuitive
. As a global leader in robotic-assisted surgery and minimally invasive care
, our technologies—like the da Vinci surgical system and Ion
—have transformed how care is delivered for millions of patients worldwide.

We’re a team of engineers, clinicians, and innovators united by one purpose: to make surgery smarter, safer, and more human. Every day, our work helps care teams perform with greater precision and patients recover faster, improving outcomes around the world.

The problems we solve demand creativity, rigor, and collaboration. The work is challenging, but deeply meaningful—because every improvement we make has the potential to change a life.

If you’re ready to contribute to something bigger than yourself and help transform the future of healthcare
, you’ll find your purpose here.

Job Description

We are seeking a highly skilled Senior Site Reliability Engineer to join our Technical Operations team and lead reliability, scalability, and performance initiatives for AI/ML workloads across multi-cloud and on-prem environments. This role will focus on building and maintaining resilient infrastructure for advanced data science workflows, including NVIDIA DGX systems
, leveraging platforms such as Domino Data Lab
, Slurm
, and NVIDIA Base Command
, while driving automation, observability, and networking optimization

Key Responsibilities

  • Contribute to deployment, and maintenance of infrastructure across AWS, GCP, and Azure, as well as on-prem NVIDIA DGX systems.
  • Implement and manage Infrastructure as Code (IaC) using Terraform and Ansible for automated provisioning and configuration.
  • Support cloud and on-prem networking solutions for secure, high-performance connectivity.
  • Manage and optimize Domino Data Lab workflows and Slurm clusters for distributed training and inference.
  • Integrate and support NVIDIA Base Command for GPU-based compute environments.
  • Develop automation scripts and tools in Python to streamline operations and improve reliability.
  • Support CI/CD pipelines using Git Lab, ensuring smooth deployments to UAT and production environments.
  • Implement and maintain observability solutions (monitoring, logging, alerting) using tools like Prometheus, Grafana, and cloud-native services.
  • Deploy and manage Kubernetes clusters (EKS, GKE) for scalable containerized workloads.
  • Troubleshoot complex workflows and ensure high availability of critical systems.
  • Collaborate with data science and engineering teams to optimize resource utilization and workflow efficiency.
  • Drive best practices for incident response, capacity planning, and system reliability in multi-cloud and HPC environments.

Additional Responsibilities

  • Administer and optimize ITSM platforms (e.g., Jira Service Management, Service Now) for release/change/incident workflows.
  • Support tooling across CI/CD, monitoring, and ticketing systems to ensure traceability and automation.
  • Maintain documentation and evidence for audits related to release/change/incident processes.
  • Partner with Compliance and Info Sec teams to ensure controls meet HIPAA, HITRUST, FDA GxP, and ISO 27001 standards.
  • Act as the primary liaison between engineering, product, support, and compliance teams for operational readiness.
  • Facilitate regular status updates, incident reviews, RCA’s and change planning sessions with stakeholders.
  • Support in updating onboarding materials and training sessions for engineers and product managers on release/change/incident protocols.
  • Promote a culture of ownership and reliability through education and process transparency.
  • Support retrospectives for major releases and incidents to identify process gaps and improvement opportunities.
  • Track and report on KPIs such as change success rate, incident recurrence, and release velocity.
  • Identify operational risks and elevate proactively to leadership.
  • Maintain…
Position Requirements
10+ Years work experience
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
 
 
 
Search for further Jobs Here:
(Try combinations for better Results! Or enter less keywords for broader Results)
Location
Increase/decrease your Search Radius (miles)

Job Posting Language
Employment Category
Education (minimum level)
Filters
Education Level
Experience Level (years)
Posted in last:
Salary