Lead,Service Reliability Engineer R&D Job Raritan area,New Jersey USA,IT/Tech

Job Title

Lead, Med Tech Technology Service Reliability Engineer, R&D

Location

Raritan, New Jersey, United States of America

Job Description Summary

The Service Reliability Engineer (SRE) designs, builds, and operates reliability practices and technical capabilities that ensure critical engineering and enterprise services are available, performant, secure, and resilient. This is a hands‑on, non‑manager role focused on improving service reliability through observability, incident response, automation, and engineering excellence. The SRE partners closely with Product Owners, development teams, infrastructure/platform engineering, Quality/Validation, Security, and Enterprise Architecture to define reliability targets, implement operational controls, and maintain documentation appropriate for regulated environments.

The SRE helps standardize operational patterns across environments (dev/test/prod) including monitoring baselines, access controls, runbooks, change management, and deployment readiness. Key outcomes include establishing and measuring Service Level Indicators/Objectives (SLIs/SLOs), improving alert quality and troubleshooting speed, reducing incident frequency and Mean Time to Recovery (MTTR), and enabling safe, repeatable releases through automation and operational readiness.

Major

Duties & Responsibilities

Define, implement, and continuously improve reliability standards for production services, including SLIs/SLOs, error budgets, and operational readiness criteria.
Build and maintain observability capabilities (metrics, logs, traces, dashboards) and establish actionable alerts that reflect customer impact.
Participate in on‑call rotations, lead incident triage and restoration, and drive root‑cause analysis with corrective and preventive actions.
Engineer reliability improvements through automation (self‑healing, auto‑remediation, runbook automation) and eliminate toil through scripting and tooling.
Partner with engineering teams to design and validate resilient architectures (timeouts/retries, circuit breaking, queuing, graceful degradation) and to improve deployment safety.
Perform capacity planning and performance analysis; proactively identify bottlenecks and reliability risks, and validate scaling strategies.
Establish and maintain operational runbooks, playbooks, and escalation paths; conduct game days and resilience testing (failover/chaos exercises) as appropriate.
Improve change management by defining deployment/rollback standards, validating monitoring coverage, and supporting release readiness reviews across dev/test/prod.
Create and maintain operational documentation (service catalogs, SLIs/SLOs, runbooks, monitoring standards) and ensure knowledge transfer across teams.
Support validation and audit readiness by following SDLC/IT controls, producing required evidence (e.g., monitoring/test results), and supporting controlled releases in regulated environments.
Develop reliability reporting (availability, latency, error rates, MTTR, incident trends) and present insights and recommendations to stakeholders.
Apply security‑by‑design principles (identity/access, secrets management, vulnerability management, data protection) and ensure operational practices meet company standards.
Collaborate with internal teams and vendors as needed to implement reliability improvements, manage platform upgrades, and continuously improve maintainability and supportability.

Qualifications – Required

Bachelor’s degree in Computer Science, Engineering, or related discipline, or equivalent experience.
5+ years of experience in SRE, Dev Ops, platform engineering, or software engineering with substantial production operations responsibilities.
Hands‑on experience with observability and incident management practices, including monitoring/alerting design, on‑call operations, and root‑cause analysis.
Experience with infrastructure‑as‑code and CI/CD (e.g., Terraform/Cloud Formation, Git, Azure Dev Ops/Jenkins or similar) and automated testing/release practices.
Experience operating services in cloud‑hosted or hybrid enterprise environments (AWS and/or on‑prem), including networking fundamentals, secure configuration,…

Lead, Service Reliability Engineer R&D