×
Register Here to Apply for Jobs or Post Jobs. X

Senior Manager, Site Reliability Engineering; SRE

Job in Coos Bay, Coos County, Oregon, 97458, USA
Listing for: GHX
Full Time position
Listed on 2025-12-19
Job specializations:
  • IT/Tech
    Systems Engineer, Cloud Computing, IT Support, SRE/Site Reliability
Salary/Wage Range or Industry Benchmark: 80000 - 100000 USD Yearly USD 80000.00 100000.00 YEAR
Job Description & How to Apply Below
Position: Senior Manager, Site Reliability Engineering (SRE)

The Senior Manager, Site Reliability Engineering (SRE) will lead the SRE organization to deliver reliable, scalable, and resilient platforms and services. This role will own the strategy, implementation, and continuous improvement of a unified observability platform that provides end-to-end visibility into infrastructure, applications, APM, and databases, enabling proactive issue detection, faster incident resolution, and improved customer experience.

The Sr. Manager will drive practices around SLIs, SLOs, SLAs, and Error Budgets, embedding reliability into engineering culture. They will oversee incident management, RCA, proactive alerting, predictive analysis, and automation, while ensuring close collaboration with engineering, product, and platform teams.

Key Responsibilities Leadership & Team Management
  • Hire, lead, and mentor a high-performing SRE team across geographies.
  • Define and execute the SRE vision, roadmap, and strategy in alignment with business and engineering objectives.
  • Establish a healthy 24x7 on-call model, ensuring coverage while promoting team well-being.
  • Drive a blameless culture through structured postmortems and RCA follow-up actions.
Unified Observability & Monitoring
  • Build and manage a unified observability platform leveraging tools such as New Relic, Datadog, Cloud Watch, Prometheus, Grafana, Graylog, and Open Telemetry.
  • Deliver holistic monitoring across infrastructure, applications, databases, APIs, and end-user experience.
  • Implement APM (Application Performance Monitoring) to trace performance across distributed systems.
  • Establish dashboards, metrics, and proactive alerting to identify anomalies early.
  • Drive adoption of AIOps and predictive analytics for proactive reliability improvements.
Reliability Engineering
  • Define and manage SLIs, SLOs, SLAs, and Error Budgets across services.
  • Partner with engineering teams to balance velocity with reliability, ensuring adherence to Error Budgets.
  • Reduce MTTD (Mean Time to Detect) and MTTR (Mean Time to Resolve) through automation, faster detection, and better instrumentation.
  • Perform capacity planning, scalability reviews, and resiliency testing.
Incident & Problem Management
  • Lead major incident response, coordinating communications with executives and stakeholders.
  • Drive root cause analysis (RCA) and implement long-term fixes.
  • Partner with ITSM teams to align with incident, problem, and change management processes.
  • Ensure continuous improvement loops from incidents back into observability, automation, and engineering practices.
Collaboration & Cross-Functional Work
  • Collaborate with Engineering, Product, Security, Cloud, and Dev Ops teams to embed SRE practices.
  • Provide guidance on instrumentation, reliability design, and operational readiness for new services.
  • Partner with DBAs and data platform teams to monitor database health, replication, query performance, and failover readiness.
  • Champion reliability as a shared responsibility across development and operations.
Qualifications & Experience Required
  • 12+ years of experience in SRE, Operations, or Infrastructure Engineering, with 5+ years in leadership roles.
  • Proven expertise in unified observability, monitoring, and alerting across infra, apps, APM, and databases.
  • Strong knowledge of observability tools:
    New Relic, Datadog, Prometheus, Grafana, Graylog, Cloud Watch, Open Telemetry, Solar Winds.
  • Hands‑on with incident response, RCA, MTTR/MTTD reduction, and on‑call management.
  • Deep understanding of SLIs, SLOs, SLAs, and Error Budgets.
  • Strong AWS experience (EC2, ECS, EKS, networking, scaling groups).
  • Hands‑on with containers & orchestration (Docker, Kubernetes).
  • Proficiency in Python, Java, C#, & shell scripting for automation.
  • Knowledge of networking fundamentals, distributed systems, and high‑availability architectures.
  • Familiarity with ITIL/ITSM processes (incident, problem, change).
  • Strong leadership, stakeholder management, and communication skills.
Preferred
  • Experience in large‑scale SaaS or product‑driven environments.
  • Hands‑on experience with databases:
    Mongo

    DB, Elasticsearch, SQL Server, Oracle.
  • Experience with chaos engineering, resiliency testing, and disaster recovery planning.
  • Certifications:

    AWS…
Position Requirements
10+ Years work experience
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
 
 
 
Search for further Jobs Here:
(Try combinations for better Results! Or enter less keywords for broader Results)
Location
Increase/decrease your Search Radius (miles)

Job Posting Language
Employment Category
Education (minimum level)
Filters
Education Level
Experience Level (years)
Posted in last:
Salary