Sr. Site Reliability Engineer Job Seattle area,Washington USA,IT/Tech

* 24-month contract with potential to convert*

Hybrid: 4x a week onsite in Seattle, WA

Optomi, in partnership with our premier client, is seeking a highly skilled Site Reliability Engineer to support a growing portfolio of enterprise platforms, including AI-driven initiatives, automation services, and next-generation observability and data platforms. This role will focus heavily on Kubernetes-based infrastructure, platform reliability, automation, and operational scalability across complex distributed environments.

The ideal candidate will bring deep hands-on expertise in Kubernetes, infrastructure automation, and platform engineering while also serving as a technical leader capable of influencing engineering direction, reliability standards, and operational best practices across teams.

Key Responsibilities

Design, support, and optimize highly available cloud and containerized platform environments.
Lead operational reliability initiatives across distributed systems and Kubernetes-based infrastructure.
Implement and maintain monitoring, observability, telemetry, and alerting solutions for platform health and performance.
Drive capacity planning, performance optimization, and SLA/SLO reliability objectives.
Build and enhance infrastructure automation and Infrastructure-as-Code (IaC) frameworks.
Develop reusable CI/CD pipeline components, automation modules, and internal platform tooling.
Collaborate with software engineering, architecture, security, and infrastructure teams to improve scalability, resiliency, and deployment efficiency.
Troubleshoot complex infrastructure, networking, and application performance issues across large-scale environments.
Contribute to technical documentation, architecture standards, operational procedures, and engineering best practices.
Evaluate and leverage AI-assisted engineering and automation tools to improve development and operational workflows.

Required Qualifications

5+ years of experience in Site Reliability Engineering, Platform Engineering, Infrastructure Engineering, or Software Engineering.
5+ years of systems administration experience supporting large-scale enterprise environments.
5+ years of experience automating infrastructure and operational processes.
3+ years of experience building developer-facing platforms, tooling, or internal engineering services.
Experience designing reusable infrastructure modules, libraries, templates, or shared services used across multiple teams.
Strong experience operating within large enterprise organizations supporting cross-functional engineering initiatives.
Excellent written communication skills with the ability to produce technical documentation, architecture proposals, and engineering guides.

Core Expertise

Deep hands-on expertise with Kubernetes and containerized infrastructure.
Strong understanding of distributed systems, cloud platforms, and infrastructure reliability engineering.
Extensive Linux administration, troubleshooting, and performance optimization experience.
Strong experience with Infrastructure-as-Code and automation tools such as Terraform, Open Tofu, Ansible, or similar technologies.
Experience implementing and managing CI/CD pipelines using platforms such as Git Lab CI/CD, Git Hub Actions, or equivalent tools.
Strong understanding of monitoring, observability, telemetry, and logging practices.

Additional Technical Skills

Experience with cloud-native technologies and container platforms.
Familiarity with Docker and container lifecycle management.
Solid understanding of networking fundamentals including HTTP, TLS, SSH, DNS, virtual networking, and load balancing.
Experience integrating security and compliance scanning tools into CI/CD workflows.
Experience deploying and managing infrastructure programmatically through APIs and SDKs.
Ability to implement instrumentation, monitoring, and telemetry across applications and infrastructure.
Familiarity with API design principles and developer platform integration patterns.
Strong troubleshooting and root cause analysis skills for complex system and performance issues.

#J-18808-Ljbffr