Site Reliability Engineer
Listed on 2026-01-10
-
IT/Tech
Systems Engineer, Cloud Computing, SRE/Site Reliability, Cybersecurity
Job Description
Job Description
Position Summary
We are seeking a highly experienced Site Reliability Engineer (SRE) to help design, operate, and continuously improve a highly available, secure, and cost-efficient multi-cloud platform. This role is deeply hands‑on and execution‑focused, with a strong bias toward action, measurable outcomes, and incremental progress.
The ideal candidate has deep expertise in AWS and Azure, strong instincts for efficient architecture, and a passion for observability, automation, and cost optimization (Fin Ops). You will work in a high‑compliance environment, partnering closely with engineering, product, and operations teams to ensure platform health, reliability, and scalability.
This role also requires leadership through influence—guiding teams, setting technical direction, and raising operational maturity—while remaining directly engaged in delivery.
Essential Duties and Responsibilities
Reliability & Platform Health
- Design, implement, and operate reliable, scalable, and resilient systems across AWS and Azure
- Establish and improve SLOs, SLIs, error budgets, and incident response practices
- Lead root cause analysis and drive corrective actions to prevent recurrence
- Continuously improve platform uptime, performance, and operational maturity
Observability & Operational Excellence
- Design and maintain best‑in‑class observability (metrics, logs, traces, alerting)
- Ensure actionable alerts with low noise and high signal
- Use data to identify reliability risks, performance bottlenecks, and efficiency opportunities
- Drive incremental improvements using clear goals and measurable outcomes
Fin Ops & Cost Optimization (Top Priority)
- Own and drive cloud cost optimization initiatives across AWS and Azure
- Partner with engineering and leadership to align cost with business value
- Implement cost visibility, forecasting, and accountability practices
- Identify architectural and operational improvements that reduce waste without sacrificing reliability or security
Security & Compliance
- Operate within highly regulated environments with strong security controls
- Support compliance efforts (SOC 2, CJIS preferred, NIST‑aligned practices)
- Embed reliability and compliance requirements into platform design and operations
- Partner with security teams to ensure secure‑by‑default systems
Automation & AI‑Enabled Efficiency
- Automate operational workflows using Infrastructure as Code, CI/CD, and tooling
- Leverage AI tools for analysis, incident investigation, cost insights, capacity planning, and operational efficiency
- Continuously seek opportunities to move faster and smarter through automation and intelligent tooling
Leadership & Collaboration
- Act as a technical leader and trusted partner to engineering teams
- Guide and mentor others on reliability, observability, and cost‑efficient design
- Influence architecture and operational decisions through data and collaboration
- Drive initiatives end‑to‑end with accountability and ownership
Required Qualifications
- Bachelor's Degree in Computer Science or Engineering
- 5+ years of experience in Site Reliability Engineering, Dev Ops, or Platform Engineering
- Deep hands‑on experience operating production systems in AWS and Azure
- Strong background in cloud architecture, reliability engineering, and automation
- Proven experience with observability platforms (metrics, logging, tracing, alerting)
- Demonstrated success driving cloud cost optimization / Fin Ops initiatives
- Experience working in high‑compliance environments (SOC 2 required; CJIS a plus)
- Strong scripting or programming skills (e.g., Python, Go, Bash, or similar)
- Experience with Infrastructure as Code (e.g., Terraform, Cloud Formation, ARM/Bicep)
Preferred Qualifications
- CJIS compliance experience
- Experience supporting SaaS platforms serving public sector or regulated customers
- Exposure to multi‑region, high‑availability architectures
- Experience implementing or maturing Fin Ops practices at scale
- Prior experience mentoring or leading cross‑functional technical initiatives
What Success Looks Like
- Platform reliability and performance improve measurably over time
- Cloud costs are visible, controlled, and optimized without compromising outcomes
- Incidents are fewer,…
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).