More jobs:
Principal Site Reliability Engineer
Job in
Olympia, Thurston County, Washington, 98507, USA
Listed on 2026-07-02
Listing for:
Oracle
Full Time
position Listed on 2026-07-02
Job specializations:
-
IT/Tech
Systems Engineer, Cloud Computing: Infrastructure & Operations, SRE/Site Reliability, Cybersecurity
Job Description & How to Apply Below
* Designs and architects infrastructure and service to ensure reliability and functionality. Forecasts demands and responds to capacity needs. Collaborates with software development teams to develop reliable and scalable infrastructures. Exercises judgment when performing data collection to maintain and optimize operations and reliability. Leverages advanced knowledge to perform incident response and/or maintenance tasks. Provides comprehensive health and performance reporting. Identifies and recommends opportunities for automation.
Communicates comprehensive information about services and proactively anticipates and articulates the potential impact of changes. Provides comprehensive support for technology and documents incidents. Conducts advanced experiments with new tools and develops and maintains advanced knowledge of site reliability trends.
** Responsibilities*
* ** Key Responsibilities*
* ** Capacity Ingestion and Management:*
*
- Designs
and architects infrastructure and/or service according to terms for reliability
and functionality.
- Forecasts
demands for infrastructure and responds to capacity needs, ensuring systems have
sufficient resources to handle current and future workloads and identifying
resource gaps.
- Collaborates
with the software development team to develop infrastructures, ensuring
features are reliable and scalable according to deployment requirements.
- Proactively
identifies opportunities for prototyping and drives prototyping initiatives
(e.g., testing new applications or infrastructures, assisting in onboarding) to
explore novel approaches.
** Incident and Service Lifecycle Management:*
*
- Exercises
judgment when performing data collection, triage, technical analysis, and
redirection to maintain and optimize operations and infrastructure reliability.
- Takes
proactive steps to monitor services, maintain up-to-date knowledge of their
performance, and document their condition.
- Leverages
advanced knowledge to perform incident response, root cause analyses, and/or
maintenance on assigned services (e.g., software installs, version upgrades,
security updates, backup and recovery).
- Provides
comprehensive health and performance reporting and takes appropriate actions
based on trends in data.
- May
perform provisioning to support infrastructure, applications, and services.
- May
experiment with new approaches for and performs decommissioning (e.g., shutting
down servers, removing data from databases) to remove objects that are no
longer needed.
** Automation:*
*
- Identifies
and recommends opportunities for automation and assesses potential benefits to
enhance operational efficiency.
- Develops
and implements design, automation tools, or scripts to provide solutions,
gather metrics, monitor, analyze, mitigate, or remediate issues/defects within
infrastructures.
- Conducts
testing on moderately complex automations to ensure they perform tasks
correctly and produce expected results.
** Technical Communication and Guidance:*
*
- Writes
release notes and/or communicates comprehensive information about the scale,
capacity, security, performance attributes, and requirements of services and
technology with customers and immediate and related teams.
- Proactively
anticipates and articulates the potential impact of infrastructure, feature,
and tool changes, considering their impact across team operations.
- Serves
as a resource to team members on what information to communicate and how to
communicate.
** Troubleshooting and Resolution:*
*
- Provides
comprehensive operational support for technology, serving as a key escalation
point for incidents and moderately complex issues arising within Oracle
services.
- Drives
and actively participates in on-call shifts to address issues.
- Executes
the resolution of technical issues spanning multiple services, applying
advanced investigation and debugging techniques to achieve SLOs (service level
objectives).
- Documents
incidents according to reporting methods and performs root cause analyses,
capturing essential information for analysis and future reference.
- Performs
post-mortem procedures to prevent incident reoccurrence.
** Innovation and Improvement:*
*
- Conducts
advanced experiments and evaluations of cutting-edge tools and technologies to
optimize infrastructure performance and reliability, taking proactive steps to
adhere to security standards.
- Identifies
and seeks opportunities to execute improvements for performance bottlenecks and
deployments, ensuring efficient resource usage, speed, and scalability.
- Develops
and maintains advanced knowledge of site reliability trends, sharing valuable
insights and information with senior team members, management, and beyond to
promote innovative building, testing, deploying, and running services.
- Performs
moderately complex analyses and provides clear data on production to drive
business development decisions (e.g., design changes).
** Core Responsibilities*
* ** Planning & Execution:*
*
- Manages
and coordinates moderately complex tasks, monitoring timelines and…
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
Search for further Jobs Here:
×