Job Description & How to Apply Below
Job Summary:
We’re looking for a Site Reliability Engineer (SRE)/ Application Support Engineer with 2-5 years of experience with strong technical and analytical skills to ensure the reliability, scalability, and performance of our core applications.
This role focuses on improving the stability and efficiency of distributed systems built on Java and microservices architecture , driving operational excellence through monitoring, automation, and incident management .
You’ll be part of the team that keeps our business-critical systems healthy — investigating production issues, implementing preventive measures, and collaborating with engineering teams to improve observability and resiliency.
Key Responsibilities:
Application Reliability & Performance
Monitor and maintain the health, performance, and reliability of production applications.
Define, measure, and track SLIs/SLOs for key services, driving improvements proactively.
Identify performance bottlenecks, memory leaks, and slow transactions in Java-based microservices.
Partner with development teams to design and deploy resilient, fault-tolerant systems.
Mentor developers and operations engineers on observability and debugging techniques.
Incident Management & Troubleshooting
Actively participate in incident response, triaging application issues, and restoring services quickly.
Perform deep root-cause analysis for recurring incidents and ensure permanent fixes are implemented.
Own the incident lifecycle — from detection to resolution and post-incident review.
Ensure observability tools and alert thresholds are tuned to reduce false positives and improve signal quality.
Monitoring & Automation
Enhance visibility across systems through better metrics, logs, and traces using Prometheus, Grafana, and Loki (or similar).
Automate repetitive tasks — deployments, rollbacks, scaling, and diagnostics.
Build or improve runbooks and self-healing mechanisms to reduce operational toil.
Integrate AIOps capabilities for smarter alert correlation, anomaly detection, and incident prediction.
Operational Ownership
Ensure production systems meet availability and performance targets.
Track open issues, follow up on root cause actions, and drive closure with responsible teams.
Collaborate with developers, infrastructure, and QA to maintain a consistent and stable release cycle.
Contribute to continuous improvement of deployment, monitoring, and rollback processes.
Collaboration & Communication
Work closely with product and platform engineering to integrate reliability into system design.
Communicate incident status, RCA findings, and reliability metrics to stakeholders.
Foster a reliability-first culture and advocate for operational excellence across teams.
Required Skills:
2–6 years of experience in Site Reliability Engineering or Application Operations .
Solid understanding of Java, Spring boot and microservices architecture.
Proficiency in monitoring and observability tools (Prometheus, Grafana, Loki, New Relic, or equivalent).
Familiarity with Kubernetes , containers, and CI/CD pipelines.
Familiarity with incident management , RCA, and performance debugging.
Experience with cloud platforms (AWS, Azure, or GCP).
Strong scripting skills (Bash, Python, or Go) for automation and diagnostics.
Good communication and stakeholder collaboration skills
Note that applications are not being accepted from your jurisdiction for this job currently via this jobsite. Candidate preferences are the decision of the Employer or Recruiting Agent, and are controlled by them alone.
To Search, View & Apply for jobs on this site that accept applications from your location or country, tap here to make a Search:
To Search, View & Apply for jobs on this site that accept applications from your location or country, tap here to make a Search:
Search for further Jobs Here:
×