More jobs:
Lead Site Reliability Engineer
Job in
Philadelphia, Philadelphia County, Pennsylvania, 19117, USA
Listed on 2026-05-13
Listing for:
Judge Group, Inc.
Full Time
position Listed on 2026-05-13
Job specializations:
-
IT/Tech
Cloud Computing: Infrastructure & Operations, Systems Engineer, SRE/Site Reliability, IT Project Manager
Job Description & How to Apply Below
Salary: TBD
Description:
We are seeking a Lead Site Reliability Engineer (SRE) who combines deep technical expertise with strong leadership and client-facing capabilities. This is a high-impact role responsible for ensuring the reliability, scalability, and performance of our cloud infrastructure and kiosk platform.
You will lead a team of engineers while remaining hands-on, owning uptime, SLAs, and incident management
, and driving long-term improvements in system resilience and operational maturity. This role also requires working closely with Fortune 500 clients
, translating complex technical concepts into clear, business-friendly insights.
What Makes This Role Unique
This is a rare opportunity for a hybrid leader who can:
- Operate as a hands-on SRE expert
- Lead and mentor a team of engineers
- Act as a client-facing technical advisor
- Drive both real-time operations and long-term reliability strategy
Reliability & Operations
- Own platform uptime, SLAs, and overall system reliability
- Lead incident response, root cause analysis, and postmortems
- Develop and maintain disaster recovery and business continuity plans
- Design, build, and optimize cloud infrastructure and Kubernetes environments
- Automate deployments and operational tasks using CI/CD and Infrastructure-as-Code (Terraform preferred)
- Improve system scalability, performance, and resilience
- Implement and enhance monitoring, alerting, and observability tools (e.g., Prometheus, Grafana, New Relic)
- Establish operational standards, runbooks, and best practices
- Lead, mentor, and develop a team of ~6 engineers
- Partner with platform engineering, QA, and development teams to ensure operational readiness
- Serve as a technical point of contact for clients
, clearly communicating system health, risks, and solutions
- 8+ years of experience in SRE, Dev Ops, or Platform Engineering
- 2+ years in a lead or managerial role
- Strong expertise in:
- Cloud infrastructure (AWS, Azure, or Google Cloud Platform)
- Kubernetes and containerized environments
- CI/CD pipelines and release engineering
- Infrastructure-as-Code (Terraform preferred)
- Proficiency in scripting/automation (Python, Bash, or Go)
- Deep understanding of observability, monitoring, and logging systems
- Experience with Git Ops workflows (e.g., ArgoCD)
- Proven experience managing production systems with strict uptime requirements
- Client-facing experience in enterprise or SaaS environments (required)
- Experience communicating with non-technical stakeholders and Fortune 500 clients
- Background in high-availability systems and large-scale distributed environments
- A hands-on technical leader who can balance execution and strategy
- Strong communicator with executive presence
- Someone who thrives in high-ownership, fast-paced environments
- A mentor who can elevate team performance and operational excellence
Reply STOP to opt out of receiving telephone calls and text messages from Judge and HELP for help.
Contact:
This job and many more are available through The Judge Group. Please apply with us today!
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
Search for further Jobs Here:
×