Principal Engineer,Site Reliability T500-22421 Job Hyderabad area,Telangana India,IT/Tech

Position: Principal Engineer, Site Reliability [T500-22421]
About T-Mobile:

T-Mobile US, Inc. (NASDAQ: TMUS), headquartered in Bellevue, Washington, is America’s supercharged Un-carrier, connecting millions through its strong nationwide network and flagship brands, T-Mobile and Metro by T-Mobile. Customers benefit from an unmatched combination of value, quality, and exceptional service experience.

About TMUS Global Solutions:

TMUS Global Solutions is a world-class technology powerhouse accelerating the company’s global digital transformation. With a culture built on growth, inclusivity, and global collaboration, the teams here drive innovation at scale, powered by bold thinking.

TMUS India Private Limited operates as TMUS Global Solutions.

Job Overview:

At T-Mobile, we don’t just build technology — we empower people. We believe in investing in YOU — your growth, your impact, and your future. We’re unstoppable when individuals like you come together to solve bold challenges, inspire innovation, and build platforms that serve millions.

As a Principal Site Reliability Engineer, you’ll join a world-class engineering team focused on building and scaling intelligent infrastructure for LLM-based applications, AI services, and enterprise-scale backend systems. You’ll contribute to the design and implementation of observability, automation, and incident response strategies that ensure our platforms are high-performing, reliable, and cost-effective. You’ll play a key role in driving operational excellence, supporting platform scalability, and collaborating across engineering and architecture teams.

This role provides growth opportunities to influence large-scale architecture and AI/ML reliability.

Key Responsibilities:

- Design, develop and maintain observability, monitoring, and alerting systems for AI platforms and mission-critical backend services.
- Design telemetry pipelines, logging infrastructure, and metrics dashboards using tools such as Splunk, Prometheus, Grafana, and Open Telemetry.
- Define and maintain SLOs, SLIs, and real-time health indicators across platform services and APIs.
- Participate in on-call rotations and lead the resolution of high-impact incidents, including root cause analysis and postmortem reporting.
- Collaborate with platform engineering teams to enforce governance, compliance, and security standards in production environments.
- Enhance deployment pipelines, CI/CD workflows, and infrastructure automation (e.g., Git Lab).
- Optimize and scale infrastructure components such as Kafka, HAProxy, RMQ, databases, and distributed APIs.
- Support capacity planning, cost analysis, and system tuning to improve platform performance.
- Advocate for automation-first operations, reducing manual toil through scripting and reliability tooling.
- Create and maintain documentation, runbooks, and knowledge-sharing resources across SRE and engineering teams.
- Mentor junior engineers and foster a culture of technical rigor and continuous improvement.

Qualifications:

- Bachelor’s degree in computer science, Engineering, or a related field (Master’s preferred).
- 10+ years of experience in SRE, Dev Ops, or operations engineering in cloud-based environments. Overall 15+ years in Technology space.
- Hands-on experience with monitoring, alerting, and incident response in distributed systems.
- Strong coding and scripting skills in Python, Java, or shell scripting languages such as Bash or Power Shell.
- Solid understanding of database principles and experience with distributed storage solutions such as Oracle, Cassandra, SOLR, and Kafka.
- Proficiency in CI/CD pipelines and Git Lab workflows.
- Strong working knowledge of SQL and No

SQL databases, including Oracle and Cassandra.
- Expertise in Linux, networking concepts (TLS/SSL, DNS, load balancers), and troubleshooting large-scale environments.
- Familiarity with AI/ML systems, APIs, and modern LLM tooling is a strong plus.
- Expertise in observability tools such as Splunk, Grafana, and Prometheus.

- Experience with Kubernetes, container orchestration, and hybrid/multi-cloud deployments (Azure preferred; AWS/GCP/OCI acceptable).
- Deep understanding of security concepts and protocols, including authentication, authorization,…


Increase/decrease your Search Radius (miles)



Job Posting Language

Principal Engineer, Site Reliability T500-22421