×
Register Here to Apply for Jobs or Post Jobs. X

Manager of Reliability Operations

Remote / Online - Candidates ideally in
Southfield, Oakland County, Michigan, 48076, USA
Listing for: Nexcess
Full Time, Remote/Work from Home position
Listed on 2026-05-28
Job specializations:
  • IT/Tech
    SRE/Site Reliability, Systems Engineer, Cloud Computing, IT Support
Salary/Wage Range or Industry Benchmark: 110000 - 150000 USD Yearly USD 110000.00 150000.00 YEAR
Job Description & How to Apply Below

This is a permanent, full-time, remote position.

US Pay Band - $110K - $150K. Actual compensation will vary based on experience, skills, and location.

We’re looking for a Manager of Reliability Operations to lead how we detect, respond to, and learn from failures across our platform ecosystem.

This role sits at the intersection of Operations and Engineering, bringing structure to incident response, accountability to follow-through, and clarity to reliability insights. You’ll ensure that what we learn from production directly improves how our platforms are built, operated, and scaled.

What You’ll Do
  • Own Reliability Operations & Incident Command
    • Continuously evolve and improve incident management, change management, and post-incident practices
    • Establish clear standards for incident declaration, severity, escalation, and communication
    • Ensure consistent execution across teams and continuous process improvement
    • Own the incident command function, including roles, structure, and operating procedures
    • Lead or oversee major incident response in a 24/7 production environment
    • Build and manage on-call incident commander rotations with global coverage
  • Drive Learning, Accountability & Reliability Strategy
    • Own post-incident reviews, ensuring strong root cause analysis and clear documentation
    • Translate incident trends into actionable reliability improvements
    • Drive completion of corrective actions across teams; elevate when needed
    • Define and maintain service performance and reliability targets (availability, latency, error rates)
    • Own observability strategy, including monitoring, alerting, and signal quality
    • Improve detection, reduce time to resolution, and increase platform resilience
  • Operate Across a Complex Platform Environment
    • Work across environments including virtualization platforms (VMware), distributed storage (Ceph), Linux-based systems, and hybrid cloud infrastructure
    • Support platforms that span dedicated hosting, managed applications, and high-availability cloud services
    • Ensure reliability practices scale across multiple products, brands, and customer environments
    • Provide regular, data-driven reporting to leadership on availability, incident trends, and operational performance
    • Act as the central authority on reliability insights across teams
Qualifications / What You Bring
  • Bachelor’s degree in Computer Science, Engineering, or a related field (or equivalent practical experience)
  • 7+ experience in systems operations, site reliability, or platform engineering
  • 2+ years experience leading teams or major operational functions
  • Proven experience managing incidents in a 24/7 production environment
  • Strong background in troubleshooting, root cause analysis, and operational improvement
  • Experience with change management practices
Platform & Tooling Experience
  • Monitoring and observability platforms (e.g., Datadog, Prometheus, Grafana, New Relic)
  • Incident management and alerting tools (e.g., Pager Duty, Opsgenie)
  • Infrastructure and platform technologies (Linux systems, VMware, Ceph, cloud platforms)
  • Logging and telemetry systems (centralized logging, metrics, tracing)
  • Ability to translate complex technical data into clear insights
  • Strong communication skills, especially in high-pressure situations
Nice to Have
  • Background in Computer Science, Engineering, or a related field
  • Experience in managed hosting, cloud infrastructure, or SaaS environments
  • Experience defining and tracking system reliability and performance targets
  • Familiarity with ITIL or similar operational frameworks
  • Exposure to VMware, Ceph, Linux, and Windows platforms
  • Relevant certifications (AWS, RHCE, etc.)
We Offer
  • Traditional and Roth 401k with company matching
  • A collaborative team culture
  • Consistent/set work hours
  • Challenging non-redundant daily duties
  • A voice in how things get done
Equal Employment Opportunity Policy

Liquid Web is committed to offering equal employment opportunity without regard to age, color, disability, gender, gender identity, genetic information, marital status, military status, national origin, race, religion, sexual orientation, veteran status, or any other legally protected characteristic.

#J-18808-Ljbffr
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
 
 
 
Search for further Jobs Here:
(Try combinations for better Results! Or enter less keywords for broader Results)
Location
Increase/decrease your Search Radius (miles)
0
200
Filters
Education Level
Experience Level (years)
Posted in last:
Salary