More jobs:
Principal Site Reliability Engineer
Job in
Santa Clara, Santa Clara County, California, 95054, USA
Listed on 2026-06-09
Listing for:
Palo Alto Networks
Full Time
position Listed on 2026-06-09
Job specializations:
-
IT/Tech
Systems Engineer, Cloud Computing: Infrastructure & Operations, SRE/Site Reliability, IT Support
Job Description & How to Apply Below
* At Palo Alto Networks®, we're united by a shared mission-to protect our digital way of life. We thrive at the intersection of innovation and impact, solving real-world problems with cutting-edge technology and bold thinking. Here, everyone has a voice, and every idea counts. If you're ready to do the most meaningful work of your career alongside people who are just as passionate as you are, you're in the right place.
** Who We Are*
* In order to be the cybersecurity partner of choice, we must trailblaze the path and shape the future of our industry. This is something our employees work at each day and is defined by our values:
Disruption, Collaboration, Execution, Integrity, and Inclusion. We weave AI into the fabric of everything we do and use it to augment the impact every individual can have. If you are passionate about solving real-world problems and ideating beside the best and the brightest, we invite you to join us!
We believe collaboration thrives in person. That's why most of our teams work from the office full time, with flexibility when it's needed. This model supports real-time problem-solving, stronger relationships, and the kind of precision that drives great outcomes.
** Job Summary*
* ** Job Summary*
* The Cortex team builds and delivers the industry's most advanced Sec Ops platform, consisting of XDR, XSIAM, XSOAR, and XPANSE.
As a Principal Site Reliability Engineer within the Cortex Dev Ops team, you will serve as a technical leader responsible for driving the reliability, scalability, observability, and operational excellence strategy across the Cortex platform. You will partner closely with engineering, product, and infrastructure teams to influence architecture decisions, establish reliability standards, and build innovative solutions that improve service availability, performance, and operational efficiency at global scale.
This role requires deep expertise in cloud infrastructure, observability, distributed systems, automation, and incident management. You will help shape the future direction of our observability and reliability platforms while mentoring engineers and driving best practices across the organization.
** Key Responsibilities*
* + Define and drive reliability, observability, and operational excellence standards across Cortex services and infrastructure.
+ Design and evolve large-scale observability platforms using technologies such as Prometheus, Thanos, Grafana, Open Telemetry, and cloud-native monitoring solutions.
+ Partner with engineering teams to ensure services are designed, instrumented, and operated with reliability and scalability in mind.
+ Drive improvements in monitoring, alerting, incident management, and service health to proactively identify and prevent customer-impacting issues.
+ Lead initiatives focused on automation, self-healing systems, operational efficiency, and reduction of operational toil.
+ Influence architectural decisions and technology adoption to improve platform reliability, performance, and cost efficiency.
+ Mentor engineers and provide technical leadership across multiple teams and organizations.
+ Stay current with emerging technologies and industry trends, evaluating and implementing solutions that advance Cortex's operational capabilities.
+ Provide leadership during major incidents and drive post-incident reviews focused on systemic improvements.
** Qualifications*
* ** Required Qualifications*
* + 10+ years of experience in Site Reliability Engineering, Dev Ops, Cloud Engineering, or related disciplines.
+ Deep expertise with Prometheus, Thanos, Grafana, Open Telemetry, and modern observability platforms.
+ Strong understanding of SRE principles including SLIs, SLOs, error budgets, incident management, and operational excellence.
+ Expert knowledge of Google Cloud Platform (GCP), Amazon Web Services (AWS), or similar cloud platforms.
+ Expert-level experience with Kubernetes, Docker, and cloud-native architectures.
+ Strong software engineering and automation skills using Python, Linux, Terraform, Ansible, and Git Ops practices.
+ Proven ability to influence technical direction and drive cross-functional initiatives across multiple engineering teams.
** Preferred Qualifications*
* + Experience building and operating observability platforms at large scale.
+ Experience implementing AI-driven operational tooling, automation, or AIOps solutions.
+ Strong communication and leadership skills with experience mentoring senior engineers and leading complex technical initiatives.
+ Ability to operate independently, influence stakeholders, and drive outcomes across organizational boundaries.
** Compensation Disclosure*
* The compensation offered for this position will depend on qualifications, experience, and work location. For candidates who receive an offer at the posted level, the starting base salary (for non-sales roles) or base salary + commission target (for sales/com-missioned roles) is expected to be the annual range listed below. The offered…
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
Search for further Jobs Here:
×