Site Reliability Engineer Job Hyderabad area,Telangana India,IT/Tech

SRE Observability Developer

Location:

Hyderabad | Exp: 5–10 Years | Focus: Observability-as-Code & Automation

Role Overview
We are hiring an SRE Engineer to mature the observability and RCA capabilities of our high-scale UPI payment platforms. This is a hands-on, code-driven role focused on building reliable telemetry pipelines, transaction correlation, and automated alerting frameworks. You will treat monitoring configurations as code to ensure consistent, scalable operational intelligence.

Key Responsibilities
Telemetry Standardization: Build and standardize metrics, logs, and traces across app, middleware, and infra layers. Implement custom tags/attributes for unified drill-down dashboards.
Transaction Correlation: Enable correlation for asynchronous UPI flows to provide end-to-end visibility across distributed services.
SLO & Alert Engineering: Define Golden Signals and SLIs for critical journeys (P2P, P2M). Implement Alert-as-Code using config-based anomaly detection and noise-reduction logic.
Observability-as-Code: Automate the provisioning of Grafana dashboards, alert rules, and collector configurations (Otel/Fluentd) using version-controlled scripts.
RCA & Intelligence: Build RCA-focused views for Redis, Kafka, Yugabyte

DB, and Nginx. Use synthetic monitoring and black-box exporters to gain visibility into partially controlled systems.
Operational Integration: Convert incident learnings into automated telemetry patterns. Embed observability validation into deployment and release workflows.

Must-Have Skills
1. Observability Stack
Expertise: Prometheus/Victoria Metrics, Victoria Logs/Traces, Open Telemetry (OTel), and Fluentd.
Tooling: Advanced Grafana, Alert manager, and various infrastructure exporters.
Development: Ability to develop Custom Exporters using Open Telemetry SDKs for unique business/transaction metrics.
2. Systems & Middleware
Knowledge: Deep understanding of Redis, Kafka, Nginx, and Yugabyte

DB (or similar distributed DBs).
App Tier: Proficiency with JVM/Spring Boot Actuator metrics and asynchronous request/response patterns.
Environment:

Experience with high-scale, low-latency platforms; UPI/Payments domain is highly preferred.
3. Scripting & Automation
Core

Skills:

Strong Python and Shell/Bash for automating telemetry validation and collector lifecycle management.
Mindset: Ability to treat all monitoring assets (dashboards, rules, configs) as code artifacts.

What We’re Looking For
An engineer who sees a dashboard as a product of code, not just a UI task.
Strong debugging skills across complex, on-prem distributed systems.
The ability to bridge the gap between what happened and where the code failed through advanced correlation.