Site Reliability Engineer
Listed on 2026-01-09
-
IT/Tech
Cloud Computing, IT Support, Systems Engineer, Cybersecurity
Base pay range
$/yr - $/yr
About the RoleJob Title: Site Reliability Engineer
Key Responsibilities- Implement and manage full‑stack observability using Datadog across infrastructure, applications, and services.
- Instrument monitoring agents in on‑premise, cloud, and hybrid environments.
- Design and deploy monitoring solutions including dashboards, alerts, monitors, SLA/SLO definitions, and anomaly detection.
- Integrate Datadog with third‑party systems such as Service Now, SSO, and ITSM tools.
- Instrument applications and services using Open Telemetry to collect logs, metrics, and traces.
- Build and maintain observability platforms providing deep system visibility.
- Develop dashboards and alerts using Prometheus, Grafana, Splunk, and ELK Stack.
- Automate monitoring configurations using Terraform, Ansible, and scripting.
- Integrate observability into CI/CD pipelines (e.g., Jenkins).
- Collaborate with Dev, SRE, and Dev Ops teams to align monitoring with business and operational goals.
- Support incident response, root cause analysis, and reliability improvements.
- Implement security and vulnerability management within observability platforms.
Skills & Qualifications
- Strong hands‑on experience with Datadog (Logs, Metrics, APM, Distributed Tracing).
- Hands‑on experience in cloud‑based observability solutions across AWS, Azure, and GCP.
- Strong understanding of observability concepts (Logs, Metrics, Tracing).
- Experience instrumenting systems using Open Telemetry.
- Proficiency in Python and/or Go for scripting and automation.
- Hands‑on experience with Terraform and Ansible (IaC).
- Experience with Kubernetes and containerized environments.
- Knowledge of CI/CD pipelines and automation tools (e.g., Jenkins).
- Solid background in system operations and software engineering.
- Experience with security and vulnerability management in observability platforms.
- Experience with additional observability tools such as Prometheus, Grafana, ELK Stack, Splunk, New Relic, and AWS Cloud Watch.
- Experience optimizing cloud agent instrumentation for performance and cost.
- Exposure to large‑scale, distributed, or high‑availability systems.
The salary for this position is between $120,000– $130,000 annually. Factors which may affect pay within this range may include geography/market, skills, education, experience, and other qualifications of the successful candidate.
BenefitsMedical insurance, dental insurance, vision insurance, 401(k) retirement plan, long‑term disability insurance, short‑term disability insurance, 5 personal days accrued each calendar year, 10‑15 days of paid vacation time, 6 paid holidays and 1 floating holiday per calendar year, Ascendion Learning Management System
Seniority levelMid‑Senior level
Employment typeFull‑time
Job functionInformation Technology
IndustriesTechnology, Information and Internet
Want to change the world? Let us know.
Tell us about your experiences, education, and ambitions. Bring your knowledge, unique viewpoint, and creativity to the table. Let’s talk!
#J-18808-Ljbffr(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).