×
Register Here to Apply for Jobs or Post Jobs. X

Principal Site Reliability Engineer, Infrastructure Observability

Job in College Park, Prince George's County, Maryland, 20741, USA
Listing for: T. Rowe Price
Full Time position
Listed on 2026-07-03
Job specializations:
  • IT/Tech
    Cloud Computing: Infrastructure & Operations, Systems Engineer, SRE/Site Reliability, IT Project Manager
Salary/Wage Range or Industry Benchmark: 120000 - 160000 USD Yearly USD 120000.00 160000.00 YEAR
Job Description & How to Apply Below

Role Summary

In this role as Principal Site Reliability Engineer, Infrastructure Observability you will help formulate, develop, and implement a team of Site Reliability Engineers (SREs) focused on the observability, sustainability, scalability, measurability and recoverability of T. Rowe Price’s innovative cloud & on-prem solutions by leveraging automation and best-of-breed tools. The successful candidate will have a strong operations & engineering background, is hands‑on when needed, and has expertise in the cloud environments (public, private), infrastructure operations, Dev Ops practices, CI/CD toolchain and systems, code build and deployment, incident response, and 24x7 monitoring and support.

The candidate will also have extensive experience operating within a SRE function within a complex, distributed environment. They will have a demonstrated ability to work horizontally and vertically within an organization with diverse partners and sponsor groups.

Responsibilities
  • Possesses extensive knowledge in own area of expertise and extensive in-depth knowledge of the broader portfolio for comprehensive understanding of up/downstream impacts across technology infrastructure
  • Responsibility for the design of technology solutions to prevent or minimize service disruptions
  • Prevents technology service disruptions through technology solution recommendations and automations
  • Fosters a culture of deep learning through blameless post‑mortems to improve the shared goal of reliability across services
  • Transform operations teams by facilitating internal change to adopt SRE standard methodologies across the organization and driving strategic growth in this area within Global Technology
  • Analyzes incidents impacting technology availability for high‑level trends across the broad portfolio
  • Drive initiatives to reduce or prevent technology failures in a complex, distributed technology environment
  • Pulls together information from disconnected systems into cohesive views of the technology portfolio for identifying trends, redundancies, and risk
  • Demonstrates outstanding awareness of the complexities of the tech and asset management industries
  • May lead initiatives of varying degrees of complexity that span multi‑functional areas and of varying degrees of complexity
  • Contributes to definition of target state architecture and design of the technology environment
Qualifications
  • Bachelor's degree or the equivalent combination of education and relevant experience AND 10+ years of experience designing and operating cloud infrastructure with senior‑level impact.
  • 5+ years building and supporting solutions in Amazon AWS
  • 5+ years of experience building and running a Dev Ops and/or SRE function
  • Experience with implementation and operation of the chaos model at scale
  • Strategic and program‑level implementation experience
  • Demonstrable experience implementing new technology, tools, and platforms
  • System administration and scripting experience
  • Demonstrable experience leveraging automation to proactively prevent or quickly remediate incidents
  • Fluent in multiple programming languages (e.g., Python, Java, GO, Node.js, .Net Core, etc)
  • Proficiency with database development (SQL Server, PostgreSQL, MySQL, etc)
  • Proficiency with defining, right‑sizing, tracking, and reporting on Service Level Objectives (SLOs), Service Level Indicators (SLIs), system availability, and the progress and outcomes related to reliability
  • Experience with implementing and managing Error Budgets
  • Proficiency with understanding and explaining incident situations and their recovery plans to prevent recurrence
  • Knowledge/experience driving dashboard standardization across the ecosystem for observability, APM and infrastructure monitoring, and application‑specific logging
  • Knowledge/experience with observability tools such as New Relic, Solar Winds DPA, Elastic Stack, Prometheus, Grafana, Splunk, and cloud native tools
  • Knowledge/experience with cloud management tools such as Ansible, Terraform, Vault, and Vagrant
  • Works independently, with guidance in only the most complex situations
  • Makes sound decisions with limited facts or resources
  • Balances strategic and pragmatic concerns when solving problems
  • Adjusts…
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
 
 
 
Search for further Jobs Here:
(Try combinations for better Results! Or enter less keywords for broader Results)
Location
Increase/decrease your Search Radius (miles)
0
200
Filters
Education Level
Experience Level (years)
Posted in last:
Salary