Senior Site Reliability Engineer
Listed on 2026-03-05
-
IT/Tech
SRE/Site Reliability, Cloud Computing, Systems Engineer, IT Support
At Build Ops, we’re building a software platform that empowers today’s commercial contractors. From service management to project execution, we’re reimagining how our customers operate. Our team thrives on ambition, innovation, and collaboration – qualities we look for in every new hire.
You will join our cloud infrastructure and reliability engineering team as a Site Reliability Engineer (SRE). Your primary responsibility will be to improve and protect the reliability, performance, and operability of our production systems while helping evolve our AWS-based infrastructure. We’re looking for someone with a strong SRE mindset, solid software engineering fundamentals, and deep observability expertise who can work effectively in a distributed team environment.
Reporting to the Dev Ops and SRE Manager, this is a hands‑on role where you will influence reliability strategy, build tooling and automation, and contribute directly to day‑to‑day operations in a fast‑moving, industry‑defining company.
What You’ll Do- Drive and refine modern SRE practices across services, including SLIs/SLOs, error budgets, and reliability reviews
- Design and maintain end‑to‑end observability (metrics, logs, traces, dashboards, and alerts) so teams can quickly detect, debug, and prevent issues
- Partner with product and engineering teams to design reliable services—reviewing architectures, failure modes, rollout strategies, and capacity/latency considerations
- Help evolve and operate our AWS infrastructure (networking, compute, data stores) using Infrastructure as Code (Terraform)
- Contribute code to services, tooling, and automation (for example, reliability libraries, deployment and incident tooling, health checks)
- Define, implement, and iterate on SLIs, SLOs, and error budgets with service owners, and use them to guide reliability work and release decisions
- Participate in incident response for infrastructure‑related production issues, including learning‑focused post‑incident reviews and follow‑through on action items
- Develop runbooks, safeguards, and automation that reduce manual work, improve time‑to‑diagnosis, and standardize responses to recurring scenarios
- Advocate for and implement security and compliance best practices in production environments
- Document standards, playbooks, and best practices so reliability improvements scale across teams
- Collaborate closely with software engineers, product managers, and other stakeholders to plan and deliver reliability‑focused initiatives
- 5+ years of professional experience in Site Reliability Engineering, Dev Ops, Infrastructure Engineering, or production‑focused Software Engineering, working on production systems and reliability‑focused initiatives
- Proven experience leading multi‑sprint, multi‑engineer projects (for example, reliability, performance, or infrastructure initiatives) to successful completion with clear business impact
- Defining and implementing SLIs/SLOs and error budgets
- Reducing toil through automation
- Safe deployment and rollout patterns
- Structured post‑incident reviews and continuous improvement
- Software engineering experience: you’ve written and maintained production‑quality code and can work comfortably in at least one modern language (for example, Python or Node.js/Type Script)
- Strong interest in, and experience with, using LLMs and AI‑assisted tooling in your workflow, including the ability to validate and improve what they generate
- Designing metrics, logging, and tracing for multi‑service systems
- Building actionable dashboards and alerts with clear runbooks
- Correlating metrics, logs, and traces to debug complex issues
- Experience with tools such as Datadog, Prometheus, Grafana, Honeycomb, or New Relic (we use Datadog, but vendor‑agnostic experience is welcome)
- Experience working with AWS in production and with core platform primitives such as Terraform‑based Infrastructure as Code and container/orchestration platforms (for example, Docker with ECS, EKS, or Kubernetes)
- Participating in or…
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).