More jobs:
Job Description & How to Apply Below
About T-Mobile:
T-Mobile US, Inc. (NASDAQ: TMUS), headquartered in Bellevue, Washington, is America’s supercharged Un-carrier, connecting millions through its strong nationwide network and flagship brands, T-Mobile and Metro by T-Mobile. Customers benefit from an unmatched combination of value, quality, and exceptional service experience.
About TMUS Global Solutions:
TMUS Global Solutions is a world-class technology powerhouse accelerating the company’s global digital transformation. With a culture built on growth, inclusivity, and global collaboration, the teams here drive innovation at scale, powered by bold thinking.
TMUS India Private Limited operates as TMUS Global Solutions.
Job Overview:
At T-Mobile, we don’t just build technology — we empower people. We believe in investing in YOU — your growth, your impact, and your future. We’re unstoppable when individuals like you come together to solve bold challenges, inspire innovation, and build platforms that serve millions.
As a Principal Site Reliability Engineer, you’ll join a world-class engineering team focused on building and scaling intelligent infrastructure for LLM-based applications, AI services, and enterprise-scale backend systems. You’ll contribute to the design and implementation of observability, automation, and incident response strategies that ensure our platforms are high-performing, reliable, and cost-effective. You’ll play a key role in driving operational excellence, supporting platform scalability, and collaborating across engineering and architecture teams.
This role provides growth opportunities to influence large-scale architecture and AI/ML reliability.
Key Responsibilities:
- Design, develop and maintain observability, monitoring, and alerting systems for AI platforms and mission-critical backend services.
- Design telemetry pipelines, logging infrastructure, and metrics dashboards using tools such as Splunk, Prometheus, Grafana, and Open Telemetry.
- Define and maintain SLOs, SLIs, and real-time health indicators across platform services and APIs.
- Participate in on-call rotations and lead the resolution of high-impact incidents, including root cause analysis and postmortem reporting.
- Collaborate with platform engineering teams to enforce governance, compliance, and security standards in production environments.
- Enhance deployment pipelines, CI/CD workflows, and infrastructure automation (e.g., Git Lab).
- Optimize and scale infrastructure components such as Kafka, HAProxy, RMQ, databases, and distributed APIs.
- Support capacity planning, cost analysis, and system tuning to improve platform performance.
- Advocate for automation-first operations, reducing manual toil through scripting and reliability tooling.
- Create and maintain documentation, runbooks, and knowledge-sharing resources across SRE and engineering teams.
- Mentor junior engineers and foster a culture of technical rigor and continuous improvement.
Qualifications:
- Bachelor’s degree in computer science, Engineering, or a related field (Master’s preferred).
- 10+ years of experience in SRE, Dev Ops, or operations engineering in cloud-based environments. Overall 15+ years in Technology space.
- Hands-on experience with monitoring, alerting, and incident response in distributed systems.
- Strong coding and scripting skills in Python, Java, or shell scripting languages such as Bash or Power Shell.
- Solid understanding of database principles and experience with distributed storage solutions such as Oracle, Cassandra, SOLR, and Kafka.
- Proficiency in CI/CD pipelines and Git Lab workflows.
- Strong working knowledge of SQL and No
SQL databases, including Oracle and Cassandra.
- Expertise in Linux, networking concepts (TLS/SSL, DNS, load balancers), and troubleshooting large-scale environments.
- Familiarity with AI/ML systems, APIs, and modern LLM tooling is a strong plus.
- Expertise in observability tools such as Splunk, Grafana, and Prometheus.
- Experience with Kubernetes, container orchestration, and hybrid/multi-cloud deployments (Azure preferred; AWS/GCP/OCI acceptable).
- Deep understanding of security concepts and protocols, including authentication, authorization,…
Note that applications are not being accepted from your jurisdiction for this job currently via this jobsite. Candidate preferences are the decision of the Employer or Recruiting Agent, and are controlled by them alone.
To Search, View & Apply for jobs on this site that accept applications from your location or country, tap here to make a Search:
To Search, View & Apply for jobs on this site that accept applications from your location or country, tap here to make a Search:
Search for further Jobs Here:
×