×
Register Here to Apply for Jobs or Post Jobs. X

Sr. Site Reliability Engineer, Factory Infrastructure & Systems

Job in Normal, McLean County, Illinois, 61761, USA
Listing for: Rivian
Full Time position
Listed on 2025-12-28
Job specializations:
  • IT/Tech
    Systems Engineer, IT Support, Cloud Computing, SRE/Site Reliability
Salary/Wage Range or Industry Benchmark: 100000 - 125000 USD Yearly USD 100000.00 125000.00 YEAR
Job Description & How to Apply Below
Position: Sr. Staff Site Reliability Engineer, Factory Infrastructure & Systems

About Rivian

Rivian is on a mission to keep the world adventurous forever. This goes for the emissions‑free Electric Adventure Vehicles we build, and the curious, courageous souls we seek to attract.

As a company, we constantly challenge what’s possible, never simply accepting what has always been done. We reframe old problems, seek new solutions and operate comfortably in areas that are unknown. Our backgrounds are diverse, but our team shares a love of the outdoors and a desire to protect it for future generations.

Role Summary

This Site Reliability Engineer (SRE) role owns reliability outcomes for factory digital systems spanning compute, network, and application layers. The work is split across Platform Engineering, Observability, and Tiger Team incident response. This position will be located in Normal, IL and report to our Sr. Manager, Software Infrastructure/Dev Ops.

Responsibilities

Platform Engineering
  • Design and evolve reliable, scalable, and secure platform foundations across hybrid/on‑prem factory environments (e.g., Kubernetes/EKS, vSphere/ESXi, Linux/Windows server, industrial PCs), with clear reliability and cost guardrails.
  • Codify production‑readiness standards and guardrails for factory systems (health checks, runbooks, SLOs/SLIs, deployment safety, failover patterns) aligned to Platform’s production readiness checklist.
  • Advance Infrastructure‑as‑Code and configuration automation (e.g., Terraform/Terragrunt, Ansible) for factory workloads, including provisioning, secrets, policies, and change safety.
  • Partner with Manufacturing Engineering, Factory IT, Security, and Networking to land pragmatic, operable designs; contribute to reference architectures and reusable patterns.
  • Lead or contribute to reliability initiatives (e.g., self‑healing automation, safe rollouts/canaries, rollback strategies) appropriate to level.
Observability
  • Raise the bar on end‑to‑end telemetry for factory systems: high‑signal metrics, logs, traces, and SLO‑driven alerts (e.g., Prometheus/Grafana, Loki/Tempo, Datadog, Splunk).
  • Establish consistent dashboards and service health views for shop/line‑level systems, including exporters for hypervisor/VM health and plant endpoints where feasible (e.g., vSphere exporters).
  • Improve alert quality and ownership: reduce noise, align escalation policies, and ensure actionable runbooks and health checks for critical services.
  • Build internal tooling (CLI/SDKs, operators/controllers, remediation bots) that turns telemetry into prevention and rapid response.
Tiger Team / Incident Response
  • Act as technical incident responder for factory‑impacting events; lead fast triage, stabilize services.
  • Drive post‑incident reviews that eliminate repeat failure modes; improve MTTR and availability through durable engineering fixes and process improvements.
  • Drill on‑call readiness, escalation policies, and schedules using established incident tooling and practices (e.g., Rootly/alternatives), tuned for 24x7 manufacturing operations.
  • Mentor peers through reliability deep dives, failover exercises, and simulation runbooks (breadth of mentorship scales with level).

Qualifications

  • Production experience in SRE/Platform/Dev Ops or Operations, owning availability, performance, and cost for critical services.
  • Strength in several of:
    Kubernetes/EKS and container networking; AWS primitives for resilient platforms; vSphere/ESXi and virtualization;
    Linux (and working Windows Server) administration; service discovery, load balancing, and DNS.
  • Observability across metrics/logs/traces, SLO/error‑budget practice, and alert hygiene with tools like Prometheus/Grafana, Loki/Tempo, Datadog, Splunk.
  • Production change safety:
    Git Ops, progressive delivery, guardrails in CI/CD (Git Lab preferred), automated rollbacks, and policy‑as‑code.
  • Infrastructure automation:
    Terraform/Terragrunt, Ansible, scripting (Python/Bash), secrets management, and least‑privilege patterns.
  • Incident leadership/participation in 24x7 environments; clear comms under pressure and a habit of converting learnings into durable fixes.
  • Ability to partner across Factory IT, Manufacturing Engineering, Security, Networking, and application teams; communicate…
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
 
 
 
Search for further Jobs Here:
(Try combinations for better Results! Or enter less keywords for broader Results)
Location
Increase/decrease your Search Radius (miles)

Job Posting Language
Employment Category
Education (minimum level)
Filters
Education Level
Experience Level (years)
Posted in last:
Salary