Sr. Site Reliability Engineer Job Philadelphia area,Pennsylvania USA,IT/Tech

Senior Site Reliability Engineer

The Freedom Pay Commerce Platform is the technology of choice for many of the largest companies across the globe in retail, hospitality, lodging, gaming, sports and entertainment, food service, education, healthcare and financial services. Freedom Pay's technology has been purposely built to deliver rock solid performance in the highly complex environment of global commerce. The company maintains a world-class security environment and was first to earn the coveted validation by the PCI Security Standards Council against Point-to-Point Encryption with EMV standard in North America.

Freedom Pay's robust solutions across payments, security, identity and data analytics are available in-store, online and on-mobile and are supported by rapid API adoption. The award winning Freedom Pay Commerce Platform operates on a single, unified technology stack across multiple continents allowing enterprises to deliver a consistent, repeatable experience on a global scale. Freedom Pay is a fast paced, high growth company with a great culture with competitive benefits and compensation with a business casual atmosphere.

Freedom Pay is seeking an experienced Senior Site Reliability Engineer to help ensure the highest possible availability and resiliency of a rapidly growing global payment platform. This full-time salaried position builds on a strong foundation of observability, incident response, and support experience across the development lifecycle — and pushes it forward with AI-driven operations and automation at its core. The right candidate finds real satisfaction in eliminating manual toil, treats every recurring task as an automation opportunity, and is eager to apply modern AI tooling to detect, diagnose, and resolve issues faster than ever before.

You'll join a team of SREs who work closely with other teams of world-class engineers to tenaciously and creatively solve problems and reduce manual toil wherever possible. We expect AI and automation to be a force multiplier in everything you do — from accelerating root-cause analysis and enriching alerts, to generating runbooks and codifying remediation so that the platform increasingly heals itself.

Successful candidates are heavily results-driven, bring well-established expertise across both traditional and bleeding-edge technology, and have a strong desire to continuously grow and improve themselves and our platform. This is a global operation spanning multiple regions and time zones, and the role demands the flexibility and commitment that a 24/7 payment platform requires.

This position participates in an engineering on-call rotation and provides after-hours support for production issue escalations on a rotational basis.

This position is based in the Philadelphia area with a hybrid schedule. Remote arrangements may be considered for exceptional candidates, with occasional travel to Philadelphia required.

Primary Responsibilities:

Build and maintain a comprehensive understanding of the platform and custom application stack.
Implement, maintain, and continuously improve observability strategies and metrics that ensure complete system health for numerous complex products throughout all stages of the development lifecycle, up to and including production.
Continuously identify automation opportunities and follow through to successful implementation, applying AI-assisted tooling to accelerate development and reduce manual effort.
Design, build, and maintain automated remediation and self-healing workflows that detect, triage, and resolve common failure modes with minimal human intervention.
Leverage AI/ML-driven observability — anomaly detection, alert correlation, and intelligent noise reduction — to surface issues earlier and shorten time to detection.
Use AI-assisted analysis to accelerate root-cause investigation, enrich incident context, and generate first-draft postmortems and runbooks for human review.
Handle escalations and collaborate effectively with other team members to quickly determine the root cause of any type of service degradation.
Implement, maintain, and continuously improve incident response procedures and other…