Remote Principal Site Reliability Developer- USC Job Ann Arbor Michigan USA,IT/Tech

Position: [Remote] Principal Site Reliability Developer- USC Required

Overview

Come and join us! Building on our cloud momentum, Oracle has formed a new organization—
Oracle Health
. This team focuses on product deployment, sustainability, troubleshooting, and product strategy while building a modern, automated healthcare platform. This is a net-new line of business with an entrepreneurial spirit, offering a unique opportunity to help build a world-class engineering organization centered on excellence, innovation, and real-world impact.

As a Site Reliability Dev Ops Engineer
, you will play a critical role in operating and scaling a Clinical AI Assistant platform used by healthcare professionals worldwide
. This system is designed to improve the quality, safety, and efficiency of care delivery for billions of patients globally
. Your work will directly influence the reliability and performance of AI-driven systems that clinicians depend on in high-stakes environments.

This role goes beyond traditional SRE responsibilities—you will have the opportunity to leverage AI/ML techniques and develop AIOps solutions to proactively manage system reliability, detect anomalies, automate remediation, and continuously improve service performance. You will help define how reliability engineering evolves in the context of intelligent, AI-powered healthcare systems.

You will be responsible for architecture, production operations, capacity planning, performance management, deployment, and release engineering, working across cross-functional teams to deliver highly reliable, scalable, and secure services.

Responsibilities

Own the architecture, design, implementation, and production operations of core platform and AI-driven system services
Ensure the reliability, availability, and performance of the Clinical AI Assistant platform used in real-world healthcare settings
Build and operate AIOps-driven capabilities (e.g., intelligent alerting, anomaly detection, automated remediation, predictive scaling)
Continuously improve systems through automation, self-healing mechanisms, and real-time observability
Design and develop software to enhance system scalability, efficiency, and resilience
Partner with cross-functional teams to prototype and deliver new platform services
Lead efforts in capacity planning, demand forecasting, performance tuning, and cost optimization
Solve complex distributed systems challenges in cloud-native environments and prevent recurrence through engineering rigor
Contribute to platform engineering best practices, including infrastructure as code, CI/CD, and service reliability standards
Stay current with emerging technologies in cloud, distributed systems, and AI/ML-driven operations

Key Requirements / Experience

Must-have:

Ability to obtain and maintain a federal security clearance (US citizenship required)
8+ years of experience in Site Reliability Engineering, Dev Ops, or related roles
Proven experience operating large-scale, distributed, production systems with high availability requirements
Strong experience with container orchestration (Kubernetes, Docker, or similar)
Infrastructure as Code expertise (Terraform, Ansible, Chef, Puppet, Packer, etc.)
Experience building and operating CI/CD pipelines (Git, Jenkins, Git Lab, Rundeck, etc.)
Proficiency in scripting and automation (Bash, Python, Power Shell, etc.)
Experience with at least one major cloud provider (OCI, AWS, Azure, etc.)
Strong Linux systems expertise
Experience with observability tooling (monitoring, logging, tracing) and performance optimization

Nice-to-have:

Experience supporting or operating AI/ML or LLM-based systems in production
Exposure to AIOps, intelligent automation, or ML-driven observability
Experience in healthcare or other regulated environments (HIPAA, security, compliance)
Background in high-throughput, low-latency systems supporting mission-critical workloads
Software engineering experience in Java, Python, C++, or similar languages

Benefits

US:
Hiring Range in USD from: $86,400 to $199,500 per annum. May be eligible for bonus and equity.

Oracle maintains broad salary ranges for its roles in order to account for variations in knowledge, skills, experience, market conditions and locations, as well as reflect…