Principal Site Reliability Engineer, Infrastructure Observability
Listed on 2026-07-03
-
IT/Tech
Cloud Computing: Infrastructure & Operations, Systems Engineer, SRE/Site Reliability, IT Project Manager
Role Summary
In this role as Principal Site Reliability Engineer, Infrastructure Observability you will help formulate, develop, and implement a team of Site Reliability Engineers (SREs) focused on the observability, sustainability, scalability, measurability and recoverability of T. Rowe Price’s innovative cloud & on-prem solutions by leveraging automation and best-of-breed tools. The successful candidate will have a strong operations & engineering background, is hands‑on when needed, and has expertise in the cloud environments (public, private), infrastructure operations, Dev Ops practices, CI/CD toolchain and systems, code build and deployment, incident response, and 24x7 monitoring and support.
The candidate will also have extensive experience operating within a SRE function within a complex, distributed environment. They will have a demonstrated ability to work horizontally and vertically within an organization with diverse partners and sponsor groups.
Responsibilities- Possesses extensive knowledge in own area of expertise and extensive in-depth knowledge of the broader portfolio for comprehensive understanding of up/downstream impacts across technology infrastructure
- Responsibility for the design of technology solutions to prevent or minimize service disruptions
- Prevents technology service disruptions through technology solution recommendations and automations
- Fosters a culture of deep learning through blameless post‑mortems to improve the shared goal of reliability across services
- Transform operations teams by facilitating internal change to adopt SRE standard methodologies across the organization and driving strategic growth in this area within Global Technology
- Analyzes incidents impacting technology availability for high‑level trends across the broad portfolio
- Drive initiatives to reduce or prevent technology failures in a complex, distributed technology environment
- Pulls together information from disconnected systems into cohesive views of the technology portfolio for identifying trends, redundancies, and risk
- Demonstrates outstanding awareness of the complexities of the tech and asset management industries
- May lead initiatives of varying degrees of complexity that span multi‑functional areas and of varying degrees of complexity
- Contributes to definition of target state architecture and design of the technology environment
- Bachelor's degree or the equivalent combination of education and relevant experience AND 10+ years of experience designing and operating cloud infrastructure with senior‑level impact.
- 5+ years building and supporting solutions in Amazon AWS
- 5+ years of experience building and running a Dev Ops and/or SRE function
- Experience with implementation and operation of the chaos model at scale
- Strategic and program‑level implementation experience
- Demonstrable experience implementing new technology, tools, and platforms
- System administration and scripting experience
- Demonstrable experience leveraging automation to proactively prevent or quickly remediate incidents
- Fluent in multiple programming languages (e.g., Python, Java, GO, Node.js, .Net Core, etc)
- Proficiency with database development (SQL Server, PostgreSQL, MySQL, etc)
- Proficiency with defining, right‑sizing, tracking, and reporting on Service Level Objectives (SLOs), Service Level Indicators (SLIs), system availability, and the progress and outcomes related to reliability
- Experience with implementing and managing Error Budgets
- Proficiency with understanding and explaining incident situations and their recovery plans to prevent recurrence
- Knowledge/experience driving dashboard standardization across the ecosystem for observability, APM and infrastructure monitoring, and application‑specific logging
- Knowledge/experience with observability tools such as New Relic, Solar Winds DPA, Elastic Stack, Prometheus, Grafana, Splunk, and cloud native tools
- Knowledge/experience with cloud management tools such as Ansible, Terraform, Vault, and Vagrant
- Works independently, with guidance in only the most complex situations
- Makes sound decisions with limited facts or resources
- Balances strategic and pragmatic concerns when solving problems
- Adjusts…
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).