Site Reliability Engineer - Remote Job Eden Prairie Minnesota USA,IT/Tech

** Requisition number:
** 2358259

** Job category:
** Technology

For those who want to invent the future of health care, here's your opportunity. We're going beyond basic care to health programs integrated across the entire continuum of care. Join us to start
** Caring. Connecting. Growing together.*
* Our Optum Serve IT team develops cutting-edge solutions that help people live healthier lives and help make the health system work better for everyone. From advanced data analytics and AI to cybersecurity, we use innovative approaches to solve some of healthcare's most complex challenges. To support this mission, OSIT has initiated a multi-year modernization program aimed at updating and enhancing enterprise technology systems in accordance with modern design standards

The Site Reliability Engineer will architect, develop, and maintain Optum Serve's cloud environment in both the commercial and government clouds. The role will work closely with software engineers, architects, and Dev Ops engineers to architect and maintain a secure, resilient and high-performance cloud infrastructure.

You'll enjoy the flexibility to work remotely
* from anywhere within the U.S. as you take on some tough challenges. For all hires in the Minneapolis or Washington, D.C. area, you will be required to work in the office a minimum of four days per week.

*
* Primary Responsibilities:

*
* + Build, operate, and support IaaS and PaaS infrastructure in Azure and AWS commercial and government cloud environments under established architecture and standards

+ Partner with development teams to help define, track, and report on SLIs, SLOs, and SLAs

+ Contribute to the development and support of platform services, including provisioning, configuration, deployment, and day to day operations

+ Integrate applications and platforms with centralized logging, monitoring, metrics, and incident management systems

+ Configure and maintain observability tools (dashboards, APMs, alerts) to help engineering teams safely operate applications in production

+ Participate in an on-call rotation to support software and cloud infrastructure, following documented runbooks and escalation paths

+ Support root cause analysis efforts and assist with remediation by implementing automation, monitoring improvements, and reliability fixes

+ Maintain and enhance operational tooling, scripts, and frameworks used for platform and service support

+ Execute performance and resiliency testing for platform services using existing frameworks and tools

+ Configure and tune alerts related to performance, availability, cost, security, and compliance signals

+ Follow and help improve operational processes, contributing automation to reduce manual and repetitive support activities

You'll be rewarded and recognized for your performance in an environment that will challenge you and give you clear direction on what it takes to succeed in your role as well as provide development for other roles you may be interested in.

*
* Required Qualifications:

*
* + 4+ years of experience working in a Site Reliability Engineering, Cloud Engineering, or Dev Ops role

+ Hands-on experience supporting Kubernetes (managed or bare metal) clusters in production environments

+ Some hands-on experience with monitoring and observability tools (e.g., Azure Monitor, Splunk, Dynatrace, Grafana, Prometheus)

+ Experience using Infrastructure as Code (IaC) tools such as Terraform or Pulumi

+ Experience supporting infrastructure and applications in production cloud environments

+ Experience interacting with or supporting systems that expose RESTful APIs

+ Solid working knowledge of at least one major cloud service provider (Azure preferred, AWS acceptable)

+ Working knowledge of networking fundamentals and common internet protocols

+ Understanding of identity and access management (IAM) concepts and best practices

+ Basic understanding of security concepts including encryption, PKI, and common application security risks (e.g., OWASP)

+ Familiarity with Kubernetes deployment and Git Ops tools such as Helm, ArgoCD, or Flux

+ Familiarity with IDEs and source control tools such as Visual Studio Code, Git Hub, Git Lab

+ Ability to…