×
Register Here to Apply for Jobs or Post Jobs. X

Senior Site Reliability Engineer

Job in New York, New York County, New York, 10261, USA
Listing for: DoubleVerify
Full Time position
Listed on 2026-06-02
Job specializations:
  • IT/Tech
    Cloud Computing, SRE/Site Reliability, Systems Engineer, IT Support
Salary/Wage Range or Industry Benchmark: 80000 - 100000 USD Yearly USD 80000.00 100000.00 YEAR
Job Description & How to Apply Below
Location: New York

Requirements

  • 4+ years in Site Reliability Engineering, Dev Ops, or related operational roles with proven experience in Linux/Unix systems administration
  • Proficiency in scripting and programming languages such as Python, Bash, or Go for automation and tool development
  • Strong experience with cloud infrastructure and services across GCP, AWS, and OCI, as well as container orchestration tools like Kubernetes
  • Expertise in monitoring and observability tools such as Prometheus, Grafana, Splunk, Nagios
  • Hands‑on experience with Infrastructure‑as‑Code tools like Terraform, Ansible, or Helm
  • Proven ability to develop and track SLIs, SLOs, and SLAs to drive reliability improvements
  • Deep understanding of networking, DNS, load balancing, and CDN technologies
  • Familiarity with databases (SQL, No

    SQL, Vertica, Mongo

    DB, Snowflake) and data pipeline technologies
  • Knowledge of CI/CD pipelines, Git Lab, and deployment automation
  • Experience with workflow automation platforms is a strong plus
  • Exceptional communication skills with the ability to collaborate across teams and explain technical concepts clearly
  • Proactive problem‑solving approach with a focus on automation and continuous improvement
  • Ownership mentality — you take full responsibility for complex challenges and reliably deliver outcomes
  • Trailblazing spirit — innovative use of AI, automation, and new technologies to solve problems and drive improvements
  • Passion for mentorship and knowledge sharing, elevating the capabilities of the entire team
  • (Desirable) Bachelor’s or Master’s degree in Computer Science, Engineering, or related field
  • (Desirable) Industry certifications such as
  • (Desirable) AWS Certified Dev Ops Engineer
  • (Desirable) Google Professional Cloud Dev Ops Engineer
  • (Desirable) Certified Kubernetes Administrator (CKA), or Terraform/Grafana certifications
  • (Desirable) Experience with AI-assisted development using tools like ChatGPT, Cursor, Glean, or Copilot
  • (Desirable) Familiarity with security best practices in cloud and containerized environments
  • If you think you have what it takes but you’re not sure that you check every box, apply anyway!
What the job involves
  • Build and maintain the reliability, scalability, and performance of our digital media measurement platforms
  • Implement observability best practices, including metrics collection, dashboarding, and alerting strategies that support proactive reliability improvements
  • Reduce MTTR for critical incidents through automation, improved observability, and proactive monitoring
  • Respond to incidents and drive them to resolution, managing Sev1/Sev2 situations
  • Monitor and maintain high availability infrastructure and services across GCP, AWS, OCI, and on‑premises environments
  • Lead technical projects from planning through deployment, ensuring proper stakeholder communication and team enablement
  • Build and deploy automations to eliminate operational toil and improve efficiency across deployment workflows, validation scripts, and self‑service capabilities
  • Leverage AI-assisted development tools to accelerate automation development and problem resolution
  • Build custom integrations and MCP servers for monitoring platforms to enable programmatic access and AI-driven analysis
  • Implement Infrastructure‑as‑Code using Terraform, Helm charts, Python and scripts, and configuration management tools to ensure repeatable, version‑controlled infrastructure deployments
  • Develop production automations for routine operational tasks, reducing manual intervention and accelerating task completion
  • Create and maintain documentation, runbooks, and SOPs in Confluence to ensure consistent incident response across the team
  • Participate in on‑call rotations and post‑incident reviews to minimize downtime and prevent recurrence
#J-18808-Ljbffr
Position Requirements
10+ Years work experience
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
 
 
 
Search for further Jobs Here:
(Try combinations for better Results! Or enter less keywords for broader Results)
Location
Increase/decrease your Search Radius (miles)
0
200
Filters
Education Level
Experience Level (years)
Posted in last:
Salary