×
Register Here to Apply for Jobs or Post Jobs. X

Senior Site Reliability Engineer

Job in Miami, Miami-Dade County, Florida, 33222, USA
Listing for: Iru
Full Time position
Listed on 2026-05-26
Job specializations:
  • IT/Tech
    Systems Engineer, SRE/Site Reliability, IT Support, Cloud Computing
Salary/Wage Range or Industry Benchmark: 120000 - 150000 USD Yearly USD 120000.00 150000.00 YEAR
Job Description & How to Apply Below

The Opportunity

We are looking for a Senior SRE to own how we detect, respond to, and learn from incidents, and to drive consistent observability across services and teams. This role sits at the intersection of reliability engineering and cross-team enablement—you will work alongside our Infrastructure team to complement their platform-building work with a sharp focus on operational excellence and measurable reliability. You will partner with engineering and platform teams to reduce MTTD and MTTR, and to make reliability measurable, repeatable, and ultimately team-owned.

What

You Will Do
  • Lead and refine the incident lifecycle: detection, triage, communication, mitigation, resolution, and post-incident review.
  • Define and maintain severity models, escalation paths, on-call expectations, and runbooks/playbooks—keeping them current and usable under pressure.
  • Facilitate blameless postmortems; turn findings into tracked remediations and shared learning that reduces repeat incidents.
  • Improve coordination during major incidents: roles, tooling, customer/stakeholder updates, and handoffs.
  • Partner with security, support, and product on incident communications and regulatory or contractual obligations where applicable.
Observability Standardization & SLI/SLO Evangelism
  • Establish and maintain organization-wide standards for metrics, logs, and traces in Datadog—including naming conventions, cardinality, retention, and sampling—so teams can instrument consistently and confidently.
  • Define and drive adoption of SLOs, SLIs, and error budgets across engineering teams; meet teams where they are—bootstrapping SLI/SLO programs for teams starting from scratch and improving rigor for teams that already have them, with the long-term goal of teams owning their own observability.
  • Build and maintain reusable Datadog dashboard templates, monitor templates, and alerting patterns that teams can adopt and adapt—reducing the activation energy for doing observability well.
  • Champion golden signals and RED/USE-style alerting philosophies; align alerts with user-impacting symptoms, not just low-level infrastructure noise.
  • Partner with the Infrastructure team on observability stack decisions, multi-tenancy, cost controls, and data lifecycle.
  • Continuously reduce alert noise through threshold tuning, ownership assignment, and on-call load management.
Reliability Culture
  • Mentor engineers on operational excellence, safe deployment practices, and production readiness; help engineering teams grow their own reliability instincts.
  • Contribute to capacity planning, chaos/game-day exercises, and reliability reviews for critical changes.
  • Serve as a connective layer between the SRE and Infrastructure teams—aligning on tooling, standards, and shared goals.
Requirements
  • Experience: 5+ years in SRE, production engineering, or equivalent, including on-call responsibility for customer-facing systems.
  • Incidents: Proven experience running or significantly improving incident response (process, tooling, or both) in a distributed systems environment.
  • Observability: Deep, hands-on experience with Datadog—building dashboards, monitors, and instrumentation standards across multiple teams or services. Experience with metrics, logging, and tracing at scale.
  • SLI/SLO Programs: Demonstrated experience defining SLOs/SLIs and error budget policies in production; comfortable working with teams to codify the metrics their reliability posture is based on.
  • Systems: Strong understanding of Linux, networking, distributed systems failure modes, and cloud or hybrid infrastructure (Kubernetes, load balancers, databases, queues).
  • Automation: Proficiency in at least one of Go, Python, or similar for tooling and automation; comfort with IaC concepts (Terraform or equivalent).
  • Communication: Clear written and verbal communication; ability to facilitate discussions during high-pressure incidents and deliberate postmortems alike.
  • Collaboration: Track record of influencing without direct authority and driving adoption across engineering teams.
Nice to Have
  • Experience with Open Telemetry or similar vendor-neutral instrumentation strategies.
  • Familiarity with Pager Duty, Incident.io, Opsgenie, or similar;
Position Requirements
10+ Years work experience
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
 
 
 
Search for further Jobs Here:
(Try combinations for better Results! Or enter less keywords for broader Results)
Location
Increase/decrease your Search Radius (miles)
0
200
Filters
Education Level
Experience Level (years)
Posted in last:
Salary