Site Reliability Engineer Job Chennai area,Tamil Nadu India,IT/Tech

Staff Site Reliability Engineer

Who we are:

Arcadia is the AI-powered energy intelligence platform for businesses. We replace fragmented tools and manual workflows with one platform to pay utility bills, buy energy, and advance sustainability — across every location, at enterprise scale.
Trusted by Fortune 2000 companies, Arcadia combines unified data, AI-powered analytics, and expert advisory to help enterprise teams save money, mitigate risk, and cut carbon .
We deliver this through three comprehensive solutions:
Utility Bill Management:
Automating the entire utility bill lifecycle — from data capture and validation to payment processing and auditing.
Energy Procurement Advisory :
Bringing together comprehensive data, AI-powered analytics, market expertise, and a strong partner network to make sophisticated procurement options accessible to all. .
Sustainability Reporting   — Verified emissions data with seamless integration into leading sustainability platforms.
Tackling the world's most complex energy challenges requires diverse thinking. We're building teams of people from different backgrounds, industries, and disciplines — united by a belief that energy management should be simple, intelligent, and a genuine driver of business value.

What we're looking for:

We are seeking a Staff Site Reliability Engineer (L4) to join our SRE/Platform Engineering team in India. This is a senior technical leadership role — not people management, but engineering leadership through execution, mentorship, and architectural ownership.
Our India SRE team is growing, and this role is central to that growth. As we scale, we need a technical anchor in the India timezone who can independently own multi-week SRE projects from problem statement to production, make sound architectural decisions under ambiguity, and elevate the team around them. You will be the person engineers lean on for design reviews, debugging escalations, and 'how should we approach this' conversations.

You'll bring the depth and experience to drive execution autonomously in the India timezone while collaborating closely with US-based SRE leadership on roadmap priorities, incident response, and platform strategy.
This is a role for someone who doesn't wait for direction — you identify reliability gaps, propose solutions, build consensus, and ship.
Our infrastructure is primarily AWS-based, managed by Terraform and Cloud Formation, and deployed using CI/CD best practices. In your application, please include a link to Git Hub or another place where your code is published, though we understand that not everyone has public code online.
What you'll do:
Own and deliver SRE projects end-to-end   — from scoping and design through implementation, testing, rollout, and documentation
Serve as a technical anchor for the India SRE team   — conduct design reviews, pair on complex debugging, and mentor engineers to develop the judgment to work through ambiguous problems independently
Design and implement infrastructure solutions   across AWS (EKS, VPC, RDS, IAM, Cloud Watch, Cloud Trail, Guard Duty, S3, Cloud Front, Lambda, SQS) using Terraform and Cloud Formation, with an emphasis on making the right tradeoffs between speed, reliability, and cost
Lead Kubernetes operations   including cluster upgrades, capacity planning, CNI troubleshooting, workload scaling, Helm chart packaging, and Git Ops deployments — and build the runbooks and automation so these become repeatable rather than one-off heroics
Evolve CI/CD pipelines   across Jenkins (Groovy scripting), Git Hub Actions, AWS Code Pipeline, ArgoCD, and FluxCD — with an emphasis on reducing manual deployment steps and improving rollback safety
Drive observability stack enhancements   — deliver the infrastructure and architectural direction necessary for engineering teams to leverage Prometheus, Grafana, and Cloud Watch effectively
Identify and execute Fin Ops initiatives   — find zombie resources, right-size instances, enforce tagging standards, and present cost-reduction recommendations with data to back them up
Manage database reliability   across MySQL and Postgre

SQL including backup validation, performance tuning,…