Senior Engineer - Site Reliability T500-27011 Job Hyderabad area,Telangana India,IT/Tech

Position: Senior Engineer - Site Reliability [T500-27011]
About T-Mobile:

T-Mobile US, Inc. (NASDAQ: TMUS), headquartered in Bellevue, Washington, is America’s supercharged Un-carrier, connecting millions through its strong nationwide network and flagship brands, T-Mobile and Metro by T-Mobile. Customers benefit from an unmatched combination of value, quality, and exceptional service experience.

About TMUS Global Solutions:

TMUS Global Solutions is a world-class technology powerhouse accelerating the company’s global digital transformation. With a culture built on growth, inclusivity, and global collaboration, the teams here drive innovation at scale, powered by bold thinking.

TMUS India Private Limited operates as TMUS Global Solutions.

This role ensures the reliability and resilience of digital infrastructure to support efficient software development and deployment. It involves automating processes and reducing manual effort to prevent operational incidents and improve system performance. The role requires expertise in programming, scripting, incident response management, and various technical tools to maintain system robustness. Success is measured by system stability, incident reduction, and continuous improvement in operational efficiency.

The work directly impacts organizational stability and customer experience by maintaining high-performing and reliable systems.

The Sr Engineer, Site Reliability is the core operations engineer, capable of resolving complex incidents, improving automation, and mentoring Engineer(s). They bridge operations and engineering by identifying recurring issues and creating scalable fixes.

What You’ll Do:

- Lead resolution of high-severity/complex incidents across hybrid infrastructure.
- Architect and implement automation frameworks, self-healing workflows, and AI-driven ops.
- Define SRE best practices, reliability SLIs/SLOs/SLAs, and operational standards.
- Partner with application and platform engineering teams to improve resilience.
- Drive observability maturity: predictive monitoring, anomaly detection, automated RCA.
- Own continuous improvement of Engineer(s)/Sr Engineer(s) runbooks and automation pipelines.
- Provide technical leadership, mentor junior SREs, and conduct training.
- Identify new technologies, tools, and processes that elevate operational excellence.

What You’ll Bring :

- 10+ years in SRE/Dev Ops/Systems/Platform Engineering as Principal or Staff engineer.
- Deep expertise in Kubernetes, distributed systems, and multi-cloud infrastructure.
- Strong knowledge of security, WAFs, and networking at scale.
- Advanced automation and programming (Python, Go, Terraform, Ansible).
- Experience applying AI/ML to operations (AIOps platforms, anomaly detection, predictive scaling).
- Strong incident command and leadership skills during outages.
- Proven track record of driving automation-first operations transformations.

Must Have

Skills:

Incident Command & Complex Troubleshooting:

- Expectation:
Take leadership during high-severity outages, orchestrating technical response across teams.
- Example:
Lead a Sev-1 bridge call where multiple microservices are failing due to cascading Kubernetes issues; coordinate DB, infra, network, security and app teams to isolate the problem.

Deep Kubernetes & Distributed Systems Expertise:

- Expectation:
Design, troubleshoot, and optimize complex Kubernetes clusters and multi-region deployments
- Example:
Diagnose why inter-cluster communication in a service mesh is causing intermittent API failures and propose architectural fixes.

Automation Framework Design (Infra & Ops):

- Expectation:
Architect automation platforms to reduce manual toil, enable self-service, and support auto-remediation.
- Example:
Build an Ansible/Terraform-based automation pipeline that provisions, configures, and tests new app environments with zero manual steps.

Observability Strategy & Advanced Monitoring:

- Expectation:
Define enterprise-wide observability standards (SLIs/SLOs/SLAs), implement anomaly detection, and predictive monitoring.
- Example:
Roll out a metrics-based SLO framework for all API services with automated burn-rate alerts in Prometheus.

Database & Application Performance Engineering:

- Expectation:
Tune databases, caching layers, and app performance to handle scale.
- Example:
Identify DB query patterns that degrade API performance and recommend schema/index optimizations.

Cross-Domain SME Knowledge (Networking, Storage, APIs):

- Expectation:
Act as a go-to expert across infrastructure layers.