Site Reliability Engineer; SRE Job Aurora Colorado USA,IT/Tech

Position: Site Reliability Engineer (SRE)

All Jobs >
Site Reliability Engineer (SRE)

Digital Charter (DCIT) is searching for an experienced Site Reliability Engineer (SRE) for a full-time remote position
. The ideal candidate will have a strong background in cloud infrastructure, software development, automation, and database reliability engineering.

This role involves operating at the system level across the full system lifecycle – from requirements and design through implementation and production operations – while ensuring platforms and data services are secure, scalable, and highly available.

This candidate will build automation to reduce operational toil, improve observability, and harden cloud and database environments through resilient architecture, performance testing, and disciplined incident management.

The starting salary for this position is $160,000.

* This is an ongoing positional requirement for multiple proposal efforts. Based on the Client’s final determination of requirements, if a role becomes available that matches your qualifications, a recruiter may reach out.

Requirements

Essential Duties and Responsibilities

Own reliability outcomes for cloud services and database platforms, operating at the system level across the full Software Development Lifecycle (SDLC).
Define and document service and data reliability requirements, including availability, latency, throughput, RPO/RTO, retention, and compliance restraints.
Translate requirements into technical architecture decisions.
Design implement Infrastructure-as-Code (IaC) and automation for Amazon Web Services (AWS) and Microsoft Azure environments, including network, compute, Identity and Access Management (IAM), and platform services.
Engineer and operate containerized workloads using Kubernetes (including managed variables such as EKS and AKS), including cluster operations, workload scheduling, scaling, and upgrades.
Build and manage CI/CD pipelines to enable safe, repeatable deployments.
Implement progressive delivery patterns and reliable rollback strategies.
Create and maintain robust monitoring, logging, and alerting solutions across cloud infrastructure, applications, and databases – emphasizing early detection of degradation (e.g., latency, error rate, saturation) rather than exclusively hard failures.
Establish and manage SLIs, SLOs, and error budgets for critical services and data platforms.
Drive continuous improvements aligned to reliability targets.
Lead incident response and participate in on-call rotations.
Perform systematic troubleshooting across distributed systems, networks, and data layers.
Conduct blameless post-incident reviews with corrective and preventative actions.
Design, implement, and continuously validate backup, retore, and disaster recovery strategies.
Automate runbooks and regularly test recovery procedures.
Evaluate and implement high availability and replication patterns; validate failover behavior and recovery objectives using measurable tests.
Develop and execute database design validation scenarios early in the SDLC to assess suitability of proposed design alternatives, generating sufficient performance data to support design decisions.
Partner with engineering teams on architectural designs.
Lead performance engineering efforts by creating baselines, conducting load and stress tests, diagnosing issues, and recommending changes to enhance operations.
Ensure security, compliance, and operational controls are built-in via least-privilege access, secrets management, encryption in transit, encryption at rest, auditing and configuration drift detection, and vulnerability and patch management for platform components.
Produce and maintain operational documentation (e.g., runbooks, standard operating procedures, architecture decision records) and mentor team members in reliability best practices.

Qualifications Required

3–7 years of relevant experience in Site Reliability Engineering, Dev Ops, Platform Engineering, or closely related roles demonstrating skills typically gained from work across multiple projects and tool chains.
Proven experience operating at the system level across the full system lifecycle (i.e., requirements, design, implementation, and…


Increase/decrease your Search Radius (miles)



Job Posting Language