Site Reliability Engineer II
Listed on 2026-02-12
-
IT/Tech
SRE/Site Reliability, Cloud Computing, Systems Engineer, IT Support
Overview
Candescent is the leading cloud-based digital banking solutions provider for financial institutions. We are transforming digital banking with intelligent, cloud-powered solutions that connect account opening, digital banking, and branch experiences for financial institutions. Our advanced technology and developer tools enable seamless, differentiated customer journeys that elevate trust, service, and innovation. Success here requires flexibility in a fast-paced environment, a client-first mindset, and a commitment to delivering consistent, reliable results as part of a performance-driven, values-led team.
With team members around the world, Candescent is an equal opportunity employer.
Site Reliability Engineer II
Experience: 4-6 Years
Location: Bangalore (Ecospace)
Role overviewCandescent Site Reliability Engineering (SRE) mission is to proactively ensure the reliability, availability and performance of our Digital First banking applications. As a member of the SRE team, you will focus on building and operating highly reliable application platforms by applying SRE principles such as automation, observability, resilience and continuous improvement.
You will partner closely with application and platform teams to define reliability standards, implement monitoring, alerting and incident response practices and embed scalability and performance considerations into application design and delivery. Through tooling, automation, and best practices, you will help development teams build and operate services that meet agreed reliability objectives.
As a senior engineer in the organization, you will also provide mentorship within the SRE team and across peer engineering teams, helping elevate operational maturity, drive adoption of SRE practices, and strengthen reliability culture across our core initiatives.
Responsibilities- Support and operate production applications running on Kubernetes and AWS
- Troubleshoot application-level issues using logs, metrics, traces, and runtime signals
- Participate in incident response, root cause analysis, and post-incident reviews
- Work closely with development teams to understand application architecture, dependencies, and data flows
- Improve application observability by defining meaningful alerts, dashboards, and SLOs
- Automate repetitive operational tasks to reduce toil
- Support application deployments, rollbacks, and runtime configuration changes
- Identify reliability, performance, and scalability gaps in application behavior
- Drive continuous improvements in operational readiness, runbooks, and on-call practices
- Influence application teams to adopt shift-left reliability practices
- Hands-on experience supporting Java applications in production
- Strong understanding of JVM fundamentals (heap/memory management, garbage collection, OOM issues, thread analysis)
- Proven experience with SRE practices, including:
- Incident response and on-call support
- Root cause analysis and postmortems
- SLIs, SLOs, and reliability-driven operations
- Strong experience troubleshooting using application logs, metrics, and monitoring tools
- Experience operating Java applications on Kubernetes (EKS) from an application/runtime perspective
- Experience with deployment strategies (rolling, blue/green, canary)
- Ability to write automation and scripts (Python or any) to reduce operational toil
- Solid understanding of application architecture and service dependencies (databases, messaging systems, external APIs)
- Strong collaboration and communication skills; ability to work closely with development teams
- Demonstrates accountability and sound judgment when responding to high-pressure incidents
- Exposure to platform or infrastructure concepts supporting application workloads
- Experience with AWS services such as EKS, RDS/Aurora, S3, EFS, and Cloud Watch
- CI/CD pipeline experience (Git Hub Actions, Jenkins)
- Familiarity with Git Ops practices
- Experience with cloud migrations or modernization efforts
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).