×
Register Here to Apply for Jobs or Post Jobs. X

Senior Site Reliability Engineer

Job in New York, New York County, New York, 10261, USA
Listing for: Zeta Global
Full Time position
Listed on 2026-02-16
Job specializations:
  • IT/Tech
    SRE/Site Reliability, Cloud Computing
Salary/Wage Range or Industry Benchmark: 80000 - 100000 USD Yearly USD 80000.00 100000.00 YEAR
Job Description & How to Apply Below
Location: New York

Overview

Zeta Global (NYSE: ZETA) is the AI-Powered Marketing Cloud that leverages advanced artificial intelligence (AI) and trillions of consumer signals to make it easier for marketers to acquire, grow, and retain customers more efficiently. Through the Zeta Marketing Platform (ZMP), our vision is to make sophisticated marketing simple by unifying identity, intelligence, and omnichannel activation into a single platform – powered by one of the industry’s largest proprietary databases and AI.

Our enterprise customers across multiple verticals are empowered to personalize experiences with consumers at an individual level across every channel, delivering better results for marketing programs. Zeta was founded in 2007 by David

A. Steinberg and John Sculley and is headquartered in New York City with offices around the world. To learn more, go to

The Role

We’re looking for an experienced Senior Site Reliability Engineer (SRE) who can write production-grade code, have mastery of SLIs, SLOs, and error budgets, and are passionate about building scalable observability systems.

If You
  • Can code confidently in Python or Golang and solve real-world problems through automation. (not only scripting)
  • Have hands-on experience implementing SLIs, SLOs, and distributed tracing in production.
  • Understand Kubernetes, Terraform, and Infrastructure as Code tools.
  • Have hands-on experience with Chaos Engineering and anomaly detection.
  • Are excited about working with high-throughput, distributed systems processing millions of transactions daily.
Key Responsibilities
  • Design, implement, and manage SLOs, SLIs, and error budgets, ensuring reliability aligns with user expectations and business objectives.
  • Develop production-grade software to enhance system reliability and reduce manual toil through automation.
  • Implement and optimize observabilitysolutionsusing tools like Open Telemetry, with a focus on high-cardinality metrics, distributed tracing, and actionable insights.
  • Drive postmortem processes and lead in-depth root cause analyses for incidents, ensuring lessons learned are effectively applied to prevent recurrence.
  • Define and monitor MTTx metrics (MTTA, MTTR, MTTF), using them to guide system improvements and measure reliability progress.
  • Design and participate in Chaos Engineering exercises.
  • Collaborate with engineering teams to design systems with reliability and scalability in mind, incorporating capacity planning, resiliency patterns, and modern deployment strategies (e.g., Canary, Blue-Green).
  • Lead design reviews for alerting strategies, ensuring effective signal-to-noise ratios in monitoring and incident management.
  • Advocate for and implement best practices in incident response and system design to achieve optimal uptime and performance.
Your Experience Strong Coding Background
  • 4+ years of experience as an SRE or in a similar role with hands-on coding.
  • 3+ years of software development experience in Python or Golang, with a focus on building maintainable, production-quality code.
SRE Expertise
  • Deep understanding of SRE principles, particularly SLIs, SLOs, error budgets, and their real-world application.
  • Hands-on experience conducting postmortems and implementing observability at scale.
  • Hands-on experience conducting chaos engineering exercises.
Observability Skills
  • Expertise in designing and implementing end-to-end observabilitysolutions using tools like Open Telemetry, Prometheus, Grafana, or Honeycomb.
  • Experience with distributed tracing and handling high-cardinality metrics in production environments.
Infrastructure Knowledge
  • 3+ years of experience with AWS and proficiency in Kubernetes, Terraform, and Infrastructure as Code (IaC) tools.
  • Strong understanding of distributed systems, microservices architectures, and containerization (Docker, Kubernetes).
Monitoring And Automation
  • Hands-on experience with CI/CD platforms (Git Ops, Jenkins, ArgoCD) and building automated pipelines.
  • Familiarity with tools and frameworks for incident management and operational automation.
Additional Skills
  • Knowledge of modern deployment strategies (e.g., Canary,Blue-Green) and resiliency patterns (e.g., circuit breakers, retries).
  • Strong analytical skills for…
Position Requirements
10+ Years work experience
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
 
 
 
Search for further Jobs Here:
(Try combinations for better Results! Or enter less keywords for broader Results)
Location
Increase/decrease your Search Radius (miles)

Job Posting Language
Employment Category
Education (minimum level)
Filters
Education Level
Experience Level (years)
Posted in last:
Salary