Production Engineer; Cloud Platform & Reliability – Machine Identity Security - hybrid Job Santa Clara area,California USA,Software

Position: Staff Production Engineer (Cloud Platform & Reliability – Machine Identity Security) - hybrid

Job Summary

The Production Engineering team is responsible for building, scaling, and operating the cloud platform for Cyber Ark Machine Identity Management products. Our solutions are trusted by the world's largest organizations to protect and manage TLS machine identities, SSH machine identities, and code signing identities.

As a Senior Staff Production Engineer, Platform on the Production Engineering Team, you will design and build the foundational cloud platform capabilities that power our products rating at the intersection of software engineering, distributed systems, cloud infrastructure, and production reliability, you will develop platform services, automation frameworks, internal tooling, and cloud-native architectures that improve developer productivity, operational efficiency, and system resiliency.

This role is ideal for engineers who view infrastructure as software, enjoy solving complex distributed systems challenges, and are passionate about building platforms that enable engineering teams to move faster and operate reliably at scale.

Platform Engineering & Software Development

Design, build, and maintain cloud platform capabilities, automation services, and engineering tooling that improve scalability, reliability, and developer productivity.
Develop software solutions and platform services using Python, Go, or similar languages to automate operational workflows and eliminate manual processes.
Build reusable platform frameworks, Infrastructure as Code modules, and self‑service capabilities that accelerate engineering teams.
Design and evolve Internal Developer Platform (IDP) capabilities that improve onboarding, deployment, and operational consistency.
Drive engineering standards and platform best practices across cloud‑native environments.

Distributed Systems & Cloud Infrastructure

Architect and evolve highly available cloud platforms supporting large‑scale distributed systems.
Design scalable Kubernetes‑based platform capabilities and cloud‑native infrastructure solutions.
Partner closely with software engineering teams to improve service architecture, resiliency, and operational readiness.
Design and implement observability, monitoring, and telemetry solutions across platform services and distributed environments.
Lead scalability, performance, and reliability initiatives across cloud infrastructure and platform components.

Production Reliability & Operational Excellence

Lead incident response, root cause analysis, and long‑term remediation efforts.
Drive improvements in availability, resiliency, and operational maturity across production systems.
Establish SLI/SLO‑driven reliability practices and engineering accountability.
Identify and eliminate operational toil through software engineering, automation, and platform improvements.
Mentor engineers and contribute to raising the technical bar across platform and production engineering disciplines.

Qualifications

5+ years of experience in Platform Engineering, Dev Ops, Site Reliability Engineering (SRE), or Software Engineering focused on cloud platforms and infrastructure.
Strong experience designing and operating cloud infrastructure on AWS, Azure, or GCP.
Deep expertise managing and scaling Kubernetes environments (EKS, AKS, or GKE).
Strong experience with Infrastructure as Code tools (Terraform, Ansible, or Pulumi).
Proven experience designing and maintaining complex CI/CD systems (Jenkins, Git Lab CI, ArgoCD, Git Hub Actions).
Strong programming skills in Python, Go, or similar languages, with experience building automation, platform tooling, internal developer services, or cloud‑native applications.
Experience designing, building, or supporting platform capabilities that improve developer productivity, operational efficiency, and reliability.
Experience operating in high‑scale, 24/7 production environments with ownership of incident response and reliability.
Solid understanding of distributed systems, cloud‑native architectures, and modern platform engineering practices.
Solid understanding of Linux systems and networking fundamentals (DNS, TCP/IP, load balancing, VPC, mTLS).
Strong problem‑solving skills and ability to work across teams.

Nice to…