Engineer, SRE GenAI
Listed on 2025-12-08
-
IT/Tech
Cloud Computing, Systems Engineer
At T-Mobile, we invest in YOU! Our Total Rewards Package ensures that employees get the same big love we give our customers. All team members receive a competitive base salary and compensation package - this is Total Rewards. Employees enjoy multiple wealth-building opportunities through our annual stock grant, employee stock purchase plan, 401(k), and access to free, year-round money coaches. That's how we're UNSTOPPABLE for our employees!
JobOverview
As an Engineer in Site Reliability Engineering (SRE) for AI Systems, you will help ensure the reliability, scalability, and performance of AI platforms. This role includes participating in on-call rotations, improving system observability, and supporting operations across cloud-native infrastructure.
This is a hands‑on role ideal for someone with foundational SRE skills and a growth mindset to expand in GenAI and LLM infrastructure operations.
We pride ourselves on encouraging a culture of innovation, advocating for agile methodologies, and promoting transparency in all that we do. Join us in embodying the spirit of the 'Un-carrier' and make a tangible impact! Our team is dynamic where no day is the same, and we are diverse and inclusive passionate about growth and transformation. If you're up to the challenge, apply today!
Job Responsibilities- Participate in on‑call rotations to support AI platforms and respond to production incidents with urgency and precision.
- Monitor system health and performance using tools like Grafana, Splunk, and Power
BI. - Support cloud‑native infrastructure deployments, with a focus on Azure (primary), and exposure to AWS or GCP.
- Implement runbooks and automate repetitive operational tasks to reduce toil.
- Support CI/CD pipelines and IaC deployments using Gitlab pipelines, Databricks.
- Assist in the development and enforcement of Service Level Objectives (SLOs) and real‑time alerts for AI APIs and services.
- Collaborate with senior engineers to improve platform reliability and scale LLM‑based applications.
- Bachelor's Degree Computer Science, Engineering or a related field (Required)
- 2-4 years of experience in Dev Ops, SRE, or cloud platform engineering.
- Hands‑on experience with monitoring/logging systems such as Prometheus, Grafana Splunk, or Open Search.
- Familiarity with cloud environments (preferably Azure; AWS/GCP a plus).
- Experience in scripting or automation using Python, Bash, or Power Shell.
- Basic understanding of containerization (Docker Kubernetes) and CI/CD concepts.
- Willingness to participate in an on‑call schedule and incident resolution.
- Strong solving and root cause analysis skills.
- Exposure to AI/ML infrastructure or LLM‑based systems (e.g., OpenAI, ChatGPT, Azure OpenAI).
- Experience with infrastructure‑as‑code tools like Terraform or ARM templates.
- Familiarity with LLM observability or API token usage metrics.
- Passion for learning AI reliability practices and collaborating with cross‑functional teams.
Skills and Abilities
- Communication (Required)
- Customer Service (Required)
- Analytics (Required)
- Technical Writing (Required)
- At least 18 years of age
- Legally authorized to work in the United States
Travel Required (Yes/No):
Yes
DOT Regulated Position (Yes/No):
No
Safety Sensitive Position (Yes/No):
No
$92,500 - $166,800 Corporate Bonus Target: 15%
Base Pay Range above is the general base pay range for a successful candidate in the role. The successful candidate's actual pay will be based on various factors, such as work location, qualifications, and experience, so the actual starting pay will vary within this range.
BenefitsAt T-Mobile, employees in regular, non‑temporary roles are eligible for an annual bonus or periodic sales incentive or bonus, based on their role. Most Corporate employees are eligible for a year‑end bonus based on company and/or individual performance and which is set at a percentage of the employee's eligible earnings in the prior year. Certain positions in Customer Care are eligible for monthly bonuses based on individual and/or team performance.
To find the pay range for this role based on hiring location, .
At T-Mobile,…
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).