×
Register Here to Apply for Jobs or Post Jobs. X

Head of Platform Reliability and Observability

Remote / Online - Candidates ideally in
Boston, Suffolk County, Massachusetts, 02298, USA
Listing for: Geode Capital Management
Full Time, Remote/Work from Home position
Listed on 2026-03-29
Job specializations:
  • IT/Tech
    Systems Engineer, Cloud Computing, IT Support, SRE/Site Reliability
Job Description & How to Apply Below

Geode Capital Management is a global systematic investment manager.

With a robust infrastructure and talented investment professionals, Geode offers clients the scale of a large asset management firm, with benefits of a versatile platform - flexibility and customization. Our firm is able to offer institutional investors the essential building blocks for today's changing investment landscape.

The talent we hire at Geode today helps us shape the future of investment management tomorrow, while allowing individuals to thrive and grow in a positive dynamic environment. Geode seeks to enhance the employee experience with a culture of engagement, wellness, and inclusion.

Our headquarters are located in Boston’s financial district, the center of one of the world’s most vibrant finance and technology hubs.

Workplace Inclusion Statement

Geode recognizes the value that employees' individual differences can bring to the workplace. These differences may include attributes such as gender, race, ethnicity, sexual orientation or age, and may also include differences in styles of work, communications or thinking. The firm endeavors to create an inclusive work environment:

  • That supports the firm’s beliefs, values and business objectives
  • That fully leverages all employees' contributions to Geode's success
  • Where employees are treated with dignity and respect
  • That challenges employees to grow and develop professionally
  • That fosters innovation and creativity
  • That encourages employees to demonstrate initiative, individual responsibility and teamwork to achieve business goals
Position

Head of Platform Reliability & Observability

Location

Boston, MA

Job

448

# of Openings

1

Geode Capital Management, LLC is seeking a Head of Platform Reliability & Observability to lead the function responsible for the stability, resilience, performance and operational transparency of our mission critical platforms. This role owns the end-to-end reliability posture of production systems, spanning production support, incident management, infrastructure coordination, and observability strategy.

This is a senior leadership role with clear accountability for outcomes. You will lead and evolve the teams and practices that ensure issues are detected early, resolved quickly, and prevented from recurring. This role reports directly to the Chief Technology Officer and partners closely with engineering, infrastructure, and business stakeholders to continuously improve how our platforms operate at scale.

The ideal candidate brings a strong mix of technical depth, operational leadership, and people management, and is comfortable operating in a highly regulated, business critical environment.

This is a hybrid work environment opportunity located in Boston, MA with a weekly in office schedule of Tuesdays, Wednesdays and Thursdays and remote work availability on Mondays and Fridays.

Responsibilities
  • Own the platform reliability and observability strategy across applications, data pipelines, and supporting infrastructure
  • Lead and develop teams, both onshore and offshore, responsible for production support (L1/L2), incident response, infrastructure troubleshooting, and 24/7 monitoring
  • Serve as the senior escalation point for high severity production incidents, providing leadership, clarity, and calm during time critical events
  • Establish and enforce standards for incident management, root cause analysis, post incident reviews, and corrective action tracking
  • Partner with engineering to improve production readiness, release quality, and operational risk management
  • Drive the evolution of observability practices, including metrics, logs, alerts, dashboards, and service health indicators
  • Ensure monitoring and alerting are actionable, business relevant, and continuously improving, reducing noise and manual effort
  • Oversee Root Cause Analysis (RCA) and Post-Incident Reviews (PIRs) partnering with development teams to prevent recurring issues.
  • Analyze incident trends and operational data to identify systemic risks, recurring failure patterns, and automation opportunities
  • Champion automation, resilience, and reliability improvements that reduce toil and improve platform stability over time
  • Comm…
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
 
 
 
Search for further Jobs Here:
(Try combinations for better Results! Or enter less keywords for broader Results)
Location
Increase/decrease your Search Radius (miles)
0
200
Filters
Education Level
Experience Level (years)
Posted in last:
Salary