Principal Site Reliability Engineer Job Olympia area,Washington USA,IT/Tech

** Job Description*
* Designs and architects infrastructure and service to ensure reliability and functionality. Forecasts demands and responds to capacity needs. Collaborates with software development teams to develop reliable and scalable infrastructures. Exercises judgment when performing data collection to maintain and optimize operations and reliability. Leverages advanced knowledge to perform incident response and/or maintenance tasks. Provides comprehensive health and performance reporting. Identifies and recommends opportunities for automation.

Communicates comprehensive information about services and proactively anticipates and articulates the potential impact of changes. Provides comprehensive support for technology and documents incidents. Conducts advanced experiments with new tools and develops and maintains advanced knowledge of site reliability trends.

** Responsibilities*
* ** Key Responsibilities*
* ** Capacity Ingestion and Management:*
*
- Designs

and architects infrastructure and/or service according to terms for reliability

and functionality.

- Forecasts

demands for infrastructure and responds to capacity needs, ensuring systems have

sufficient resources to handle current and future workloads and identifying

resource gaps.

- Collaborates

with the software development team to develop infrastructures, ensuring

features are reliable and scalable according to deployment requirements.

- Proactively

identifies opportunities for prototyping and drives prototyping initiatives

(e.g., testing new applications or infrastructures, assisting in onboarding) to

explore novel approaches.

** Incident and Service Lifecycle Management:*
*
- Exercises

judgment when performing data collection, triage, technical analysis, and

redirection to maintain and optimize operations and infrastructure reliability.

- Takes

proactive steps to monitor services, maintain up-to-date knowledge of their

performance, and document their condition.

- Leverages

advanced knowledge to perform incident response, root cause analyses, and/or

maintenance on assigned services (e.g., software installs, version upgrades,

security updates, backup and recovery).

- Provides

comprehensive health and performance reporting and takes appropriate actions

based on trends in data.

- May

perform provisioning to support infrastructure, applications, and services.

- May

experiment with new approaches for and performs decommissioning (e.g., shutting

down servers, removing data from databases) to remove objects that are no

longer needed.

** Automation:*
*
- Identifies

and recommends opportunities for automation and assesses potential benefits to

enhance operational efficiency.

- Develops

and implements design, automation tools, or scripts to provide solutions,

gather metrics, monitor, analyze, mitigate, or remediate issues/defects within

infrastructures.

- Conducts

testing on moderately complex automations to ensure they perform tasks

correctly and produce expected results.

** Technical Communication and Guidance:*
*
- Writes

release notes and/or communicates comprehensive information about the scale,

capacity, security, performance attributes, and requirements of services and

technology with customers and immediate and related teams.

- Proactively

anticipates and articulates the potential impact of infrastructure, feature,

and tool changes, considering their impact across team operations.

- Serves

as a resource to team members on what information to communicate and how to

communicate.

** Troubleshooting and Resolution:*
*
- Provides

comprehensive operational support for technology, serving as a key escalation

point for incidents and moderately complex issues arising within Oracle

services.

- Drives

and actively participates in on-call shifts to address issues.

- Executes

the resolution of technical issues spanning multiple services, applying

advanced investigation and debugging techniques to achieve SLOs (service level

objectives).

- Documents

incidents according to reporting methods and performs root cause analyses,

capturing essential information for analysis and future reference.

- Performs

post-mortem procedures to prevent incident reoccurrence.

** Innovation and Improvement:*
*
- Conducts

advanced experiments and evaluations of cutting-edge tools and technologies to

optimize infrastructure performance and reliability, taking proactive steps to

adhere to security standards.

- Identifies

and seeks opportunities to execute improvements for performance bottlenecks and

deployments, ensuring efficient resource usage, speed, and scalability.

- Develops

and maintains advanced knowledge of site reliability trends, sharing valuable

insights and information with senior team members, management, and beyond to

promote innovative building, testing, deploying, and running services.

- Performs

moderately complex analyses and provides clear data on production to drive

business development decisions (e.g., design changes).

** Core Responsibilities*
* ** Planning & Execution:*
*
- Manages

and coordinates moderately complex tasks, monitoring timelines and…