Site Reliability Operations; SRO Engineer III
Listed on 2026-02-15
-
IT/Tech
IT Support, Systems Engineer, Cloud Computing
PENNYMAC
Pennymac is (NYSE: PFSI) is a specialty financial services firm with a comprehensive mortgage platform and integrated business focused on the production and servicing of U.S. mortgage loans and the management of investments related to the U.S. mortgage market.
At Pennymac, our people are the foundation of our success and at the heart of our dynamic work culture. Together, we work towards a unified goal of helping millions of Americans achieve aspirations of home ownership through the complete mortgage journey.
A Typical DayAs a member of the Site Reliability Operations (SRO) team, you will help provide 24/7 monitoring and support of the company’s IT Infrastructure. Ideal candidates should have experience in Windows and Linux administration, in addition to experience working in AWS, as Pennymac is now almost completely migrated into the AWS cloud. Individuals in this role should be comfortable working in a fast-paced environment.
Multitasking, in addition to communicating quickly and accurately, is critical to the success of anyone in this role.
Eng III, Site Reliab Ops will:
- Monitoring – Oversee 24/7 health monitoring of the company’s IT Infrastructure using tools such as AWS Cloud Watch and New Relic. Drive observability maturity across the organization by identifying coverage gaps and implementing targeted improvements.
- Alert Management – Own the ongoing refinement of operational alerts. Implement advanced alerting rules and thresholds to proactively identify issues, reduce noise, and ensure every alert drives action.
- Observability Gap Analysis – Partner with Incident Management to identify monitoring and alerting gaps discovered during incident triage; prioritize and implement enhancements to prevent recurrence.
- App Team Engagement – Serve as an observability resource to application teams, assessing current instrumentation and providing actionable recommendations to improve monitoring maturity.
- Alert Quality Ownership – Lead initiatives to reduce alert noise, improve signal-to-noise ratio, and ensure every alert is actionable with clear runbook linkage.
- Operational Dashboard Development – Design and maintain operationally-focused dashboards in New Relic that support 24/7 triage, SLA tracking, and real-time incident response.
- Incident Management – Serve as an escalation point for complex incidents. Collaborate closely with the Incident Management team, Application Developers, Internal Support Teams, and 3rd Party Vendors to ensure timely and accurate resolution of service disruptions.
- Advanced Systems Administration – Perform and troubleshoot a wide range of administrative tasks across Windows and Linux environments. Assist in optimizing system performance, conducting root‑cause analyses, and implementing long‑term fixes.
- Virtual Server and Desktop Management – Handle more complex tasks associated with maintaining and troubleshooting the company’s virtual infrastructure. Provide guidance to junior engineers for routine issues.
- Technical Troubleshooting and Investigation – Tackle advanced technical issues that are escalated from Engineer I/II. Conduct deep dives into infrastructure and application logs to pinpoint underlying problems.
- Internal and External Escalation – Act as a liaison between multiple internal teams and external vendors for high‑priority incidents. Ensure swift coordination and minimize downtime.
- Change Management – Strictly follow and help refine the company’s established Change Management processes. Provide risk assessments and validation for proposed changes before approval.
- Communication – Monitor and respond to incoming Calls, Chats, and Emails directed to the SRO team. Offer structured feedback to stakeholders when complex issues are underway.
- Ticket Queue Management – Lead by example in managing multiple ticket queues (Service Now, JIRA, etc.). Take ownership of priority tickets and oversee distribution among the team.
- Documentation – Maintain and expand the SRO team’s knowledge base. Author new Standard Operating Procedures (SOPs) that incorporate best practices gained from resolving advanced incidents.
- Deployments – Coordinate and execute application and website code…
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).