×
Register Here to Apply for Jobs or Post Jobs. X

Lead, Service Reliability Engineer R&D

Job in Raritan, Somerset County, New Jersey, 08869, USA
Listing for: Johnson & Johnson Innovative Medicine
Full Time position
Listed on 2026-05-24
Job specializations:
  • IT/Tech
    Systems Engineer, SRE/Site Reliability, Cloud Computing, IT Support
Salary/Wage Range or Industry Benchmark: 100000 - 125000 USD Yearly USD 100000.00 125000.00 YEAR
Job Description & How to Apply Below

Job Title

Lead, Med Tech Technology Service Reliability Engineer, R&D

Location

Raritan, New Jersey, United States of America

Job Description Summary

The Service Reliability Engineer (SRE) designs, builds, and operates reliability practices and technical capabilities that ensure critical engineering and enterprise services are available, performant, secure, and resilient. This is a hands‑on, non‑manager role focused on improving service reliability through observability, incident response, automation, and engineering excellence. The SRE partners closely with Product Owners, development teams, infrastructure/platform engineering, Quality/Validation, Security, and Enterprise Architecture to define reliability targets, implement operational controls, and maintain documentation appropriate for regulated environments.

The SRE helps standardize operational patterns across environments (dev/test/prod) including monitoring baselines, access controls, runbooks, change management, and deployment readiness. Key outcomes include establishing and measuring Service Level Indicators/Objectives (SLIs/SLOs), improving alert quality and troubleshooting speed, reducing incident frequency and Mean Time to Recovery (MTTR), and enabling safe, repeatable releases through automation and operational readiness.

Major

Duties & Responsibilities
  • Define, implement, and continuously improve reliability standards for production services, including SLIs/SLOs, error budgets, and operational readiness criteria.
  • Build and maintain observability capabilities (metrics, logs, traces, dashboards) and establish actionable alerts that reflect customer impact.
  • Participate in on‑call rotations, lead incident triage and restoration, and drive root‑cause analysis with corrective and preventive actions.
  • Engineer reliability improvements through automation (self‑healing, auto‑remediation, runbook automation) and eliminate toil through scripting and tooling.
  • Partner with engineering teams to design and validate resilient architectures (timeouts/retries, circuit breaking, queuing, graceful degradation) and to improve deployment safety.
  • Perform capacity planning and performance analysis; proactively identify bottlenecks and reliability risks, and validate scaling strategies.
  • Establish and maintain operational runbooks, playbooks, and escalation paths; conduct game days and resilience testing (failover/chaos exercises) as appropriate.
  • Improve change management by defining deployment/rollback standards, validating monitoring coverage, and supporting release readiness reviews across dev/test/prod.
  • Create and maintain operational documentation (service catalogs, SLIs/SLOs, runbooks, monitoring standards) and ensure knowledge transfer across teams.
  • Support validation and audit readiness by following SDLC/IT controls, producing required evidence (e.g., monitoring/test results), and supporting controlled releases in regulated environments.
  • Develop reliability reporting (availability, latency, error rates, MTTR, incident trends) and present insights and recommendations to stakeholders.
  • Apply security‑by‑design principles (identity/access, secrets management, vulnerability management, data protection) and ensure operational practices meet company standards.
  • Collaborate with internal teams and vendors as needed to implement reliability improvements, manage platform upgrades, and continuously improve maintainability and supportability.
Qualifications – Required
  • Bachelor’s degree in Computer Science, Engineering, or related discipline, or equivalent experience.
  • 5+ years of experience in SRE, Dev Ops, platform engineering, or software engineering with substantial production operations responsibilities.
  • Hands‑on experience with observability and incident management practices, including monitoring/alerting design, on‑call operations, and root‑cause analysis.
  • Experience with infrastructure‑as‑code and CI/CD (e.g., Terraform/Cloud Formation, Git, Azure Dev Ops/Jenkins or similar) and automated testing/release practices.
  • Experience operating services in cloud‑hosted or hybrid enterprise environments (AWS and/or on‑prem), including networking fundamentals, secure configuration,…
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
 
 
 
Search for further Jobs Here:
(Try combinations for better Results! Or enter less keywords for broader Results)
Location
Increase/decrease your Search Radius (miles)
0
200
Filters
Education Level
Experience Level (years)
Posted in last:
Salary