Senior Site Reliability Engineer Job New York New York USA,IT/Tech

Location: New York

Job Title:
Senior Site Reliability Engineer

About the Role:

The Site Reliability Engineering team is part of the Digital Enterprise Technology Platform Engineering organization, responsible for architecting, scaling, and maintaining the IT monitoring and observability ecosystem. You will ensure Enterprise IT services' reliability by driving proactive telemetry strategies and deep-system visibility.

We're looking for a self-starter with the ability to take ownership of tasks, work under pressure, and balance multiple assignments simultaneously while maintaining a positive outlook. You'll lead the evolution of observability frameworks, contribute ideas, and provide feedback on complex monitoring architectures while providing expertise for IT projects and enhancements across various IT organizations.

Responsibilities:

Manage, assess, plan, and support core observability platform operations and strategy.
Lead process changes and implementations related to the monitoring and logging stack (e.g., Splunk, Grafana, New Relic).
Provide escalation support for configuration and platform issues, participating in on-call schedules to resolve major incidents using deep-dive observability data.
Collaborate with key stakeholders (Service Managers, Product Managers, Application Architects, Business Support, and Operations) to gather and develop complex monitoring and alerting requirements.
Develop AI, automation, and integrations to deliver predictive monitoring and automated anomaly detection.
Work with third-party vendors and partners to address platform-related enhancements and evaluate next-gen observability tooling.
Support and manage the introduction of new monitoring tools and orchestrate migrations to modern Open Telemetry-based standards.
Present reports on Service Level Indicators (SLIs), Service Level Objectives (SLOs), and correlation metrics to the Enterprise Operations team periodically.
Work under Agile scrum methodology and provide technical mentorship on observability best practices to junior team members.
Create standard operating procedures for monitoring-as-code and share them with the team for effective execution.

Minimum Qualifications:

Bachelor's degree in Computer Science or related technical field, or equivalent experience in technical leadership
7 - 10 years of experience designing and implementing distributed systems to handle large-scale telemetry and log data
7 - 10 years of experience building and scaling high-volume observability pipelines.
Proven mastery of full-stack observability suites (Splunk, Thousand Eyes, or similar).
Direct experience implementing Open Telemetry (OTel) standards.
Strong background in "Monitoring as Code" using Terraform or similar automation tools.
Demonstrable ability in Bash/Powershell, Python, and JavaScript (NodeJS), especially program comprehension
Understanding of REST-based API design principles and best practices
Experience with server administration (Linux and Windows)
Knowledge of monitoring tools like Zabbix, Splunk, Grafana, New Relic, or Thousand Eyes
Experience with AWS public cloud and VMware vSphere
Knowledge of configuration management and orchestration tools like Puppet, Ansible, or Terraform
Experience with Docker and containerized applications
Strong troubleshooting and debug skills (reading log files, analyzing memory leaks)
Strong analytical skills and ability to gather and synthesize data for review
Ability to problem-solve in a fast-paced environment and shift gears effectively
Subject matter expertise in at least one monitoring and telemetry product

Preferred Qualifications:

Experience with AI and machine learning applications in operations
Experience with predictive monitoring and auto-healing solutions
Master's degree in Computer Science or related field
Experience translating technical concepts into visual representations

#J-18808-Ljbffr


Increase/decrease your Search Radius (miles)



Job Posting Language