More jobs:
Job Description & How to Apply Below
This is for you if you've:
Built and operated large-scale Java/JVM services
Carried on-call and handled real production incidents
Debugged JVM, GC, latency, and concurrency issues under pressure
Implemented resilience patterns (circuit breakers, timeouts, graceful degradation)
What you'll do:
Own availability, latency, and reliability of critical services
Improve systems through code-level reliability , not just infra
Define SLIs/SLOs, lead incident reviews, reduce toil
Partner with product teams to design for failure
Reliability mindset (what differentiates this role)
Experience implementing or driving:
Circuit breakers, bulkheads, rate limiting, back pressure
Graceful degradation and fallback strategies
Familiarity with observability concepts:
Metrics (e.g., latency percentiles, saturation)
Distributed tracing
Health checks & readiness probes
Nice to have (not mandatory):
Exposure to SRE / Platform / Production Engineering
Kubernetes, observability, or chaos engineering experience
Software-first SRE role | Real ownership | Strong growth into reliability leadership
Position Requirements
10+ Years
work experience
Note that applications are not being accepted from your jurisdiction for this job currently via this jobsite. Candidate preferences are the decision of the Employer or Recruiting Agent, and are controlled by them alone.
To Search, View & Apply for jobs on this site that accept applications from your location or country, tap here to make a Search:
To Search, View & Apply for jobs on this site that accept applications from your location or country, tap here to make a Search:
Search for further Jobs Here:
×