Remote Principal Site Reliability Developer Job Santa Fe New Mexico USA,IT/Tech

Position: [Remote] Principal Site Reliability Developer (US Citizenship Required)
** Job Description*
* ** This role requires U.S. Citizenship and eligibility for a Federal Security Clearance*
* ** Our Team*
* Building off our Cloud momentum, Oracle has formed a new organization - Oracle Health Data, Analytics Platform. This team will focus on product development and product strategy for Oracle Health, while building out a complete platform supporting modernized, automated healthcare. This is a net new line of business, constructed with an entrepreneurial spirit that promotes an energetic and creative environment. We are unencumbered and will need your contribution to make it a world class engineering center with the focus on excellence.

Oracle Health Data, Analytics Platform has a rare opportunity to play a critical role in how Oracle Health products impact and disrupt the healthcare industry by transforming how healthcare and technology intersect.

You will have the opportunity to:

+ Reach billions of people with our products & services

+ Create technology in which truly impacts the world

+ Ability to have immediate impact on developing technology

+ Unlimited growth potential with inspiring work

+ Work with the best minds in the industry

+ Enjoy working in an open, diverse, and productive environment

** About The Job*
* This role provides technical leadership for the core data platforms behind Oracle Health's Data & Analytics Platform. As a Principal Site Reliability Engineer (SRE), you will own shared, mission-critical systems used by multiple products and teams.

You will lead the design and operation of large-scale, stateful distributed platforms, including Hadoop ecosystem components (HDFS, YARN, HBase) deployed on Oracle Big Data Service (BDS), Kafka, and Storm. These multi-tenant platforms are deployed and operated through Ansible- and Terraform-based automation and require strong architectural ownership to manage scale, change, and broad blast radius.

** What You'll Do*
* ** Platform Ownership & Technical Leadership*
* + Own the end-to-end reliability, scalability, and operability of shared data platforms

+ Define platform standards, architectural direction, and operational guardrails

+ Influence cross-team technical decisions and long-term platform strategy

+ Drive long-term platform evolution and influence reliability strategy across the data ecosystem

** Architecture & Design*
* + Lead platform architecture and design reviews

+ Clearly articulate system behavior, dependencies, and failure modes

+ Make principled trade-offs between reliability, performance, cost, and complexity

+ Provide guidance and guardrails that enable downstream teams to use platforms safely and effectively

** Operations Engineering*
* + Establish capacity models, scaling strategies, and operational best practices

+ Design platforms that behave predictably under load, failure, and change

+ Own platform lifecycle events: upgrades, expansions, decommissioning, and recovery

** Distributed Systems Expertise*
* + Operate and evolve stateful distributed systems where data placement, replication, and recovery are critical

+ Reason about failure modes such as back pressure, rebalancing, region movement, replication lag, and rolling upgrades

** Security*
* + Operate and maintain Kerberized platforms, including authentication, authorization, and secure service-to-service communication

+ Treat security as a first-class architectural concern

** Automation*
* + Design and evolve an Ansible- and Terraform-driven automation framework

+ Treat automation as production software: versioned, reviewed, tested, and improved

+ Eliminate operational toil by encoding reliability and safety into the platform

** Incident Leadership & Prevention*
* + Serve as the ultimate escalation point for complex or ambiguous incidents

+ Focus on eliminating entire classes of failure, not just resolving individual issues

** Representation*
* + Represent SRE and platform engineering in high-visibility and sensitive forums

+ Communicate clearly with engineering leadership and partner teams

** Responsibilities*
* ** Responsibilities*
* The team operates within the Oracle Health Data & Analytics Platform, supporting one of Oracle Health's core products, Healthe Intent. We operate the big data and streaming infrastructure that enables downstream teams to deliver reliable customer-facing solutions at scale, while continuously improving operability and efficiency.

** Required Experience*
* + 8+ years operating large-scale, customer-facing distributed platforms

+ Deep experience with HDFS, YARN, HBase, Kafka, Storm, or similar systems

+ Strong background in Linux, networking, and distributed system troubleshooting

+ Infrastructure-as-Code using Ansible and Terraform

+ Scripting and automation using Python, Ruby, and Bash

+ Hands-on experience operating Kerberized environments

+ Proven ability to define and document technical architecture for complex systems

+ Demonstrated ownership of shared platforms with broad blast radius and multiple downstream consumers

+ Experience designing observability and capacity…