Senior/Principal SRE Tech Lead
Concord, Cabarrus County, North Carolina, 28027, USA
Listed on 2026-05-31
-
IT/Tech
Systems Engineer, Cloud Computing, IT Support, SRE/Site Reliability
Overview
We are Lenovo. We do what we say. We own what we do. We WOW our customers. Lenovo is a US $69 billion revenue global technology powerhouse, ranked #196 in the Fortune Global 500, and serving millions of customers every day in 180 markets. Focused on a bold vision to deliver Smarter Technology for All, Lenovo offers a full-stack portfolio of AI-enabled, AI-ready, and AI-optimized devices, infrastructure, software, solutions, and services.
Lenovo is listed on the Hong Kong stock exchange under Lenovo Group Limited (HKSE: 992) (ADR: LNVGY). For more information, visit and read about the latest news via our Story Hub.
About Our Team:
Lenovo is building Quantum, a next‑generation hybrid AI platform that spans Windows, Android, and cloud. We are expanding the reliability engineering organization that powers Qira, Lenovo’s cross‑device Personal AI. We are looking for Senior Site Reliability Engineers (SREs) to help build and evolve the foundational reliability, observability, and operations capabilities that ensure Qira is fast, safe, and dependable for millions of users.
This role may support one of several teams within the SRE organization (e.g., Observability, Operations, or Service Reliability), depending on strengths and interests.
Qira operates with the speed, ownership, and creative latitude of a startup—yet is supported by the scale, resources, and technical depth of Lenovo. We are building new systems, tooling, and operational models from the ground up, with clarity, intention, and high engineering standards.
Location:
Open to remote work in the US. The preferred work location is Chicago, IL.
As a Senior SRE, you may be responsible for a subset of the following, depending on team placement and skill alignment:
Reliability Performance Engineering- Improving the availability, scalability, and performance of distributed systems across device, edge, and cloud.
- Defining or refining SLI, SLO, and error budgets for critical services.
- Leading initiatives to remove single points of failure, improve resilience, and reduce operational risk.
- Participating in on-call rotations and contributing to incident response, triage, and post-incident reviews.
- Developing automation, runbooks, and self-healing systems to reduce alert noise and MTTR.
- Enhancing operational readiness and supporting incident prevention programs.
- Designing or improving observability systems using Open Telemetry, Grafana, and modern signal pipelines.
- Building dashboards, analytics, and alerting that illuminate system health and AI service behavior.
- Ensuring telemetry is reliable, actionable, and tied to real-world outcomes.
- Improving reliability of CI/CD workflows, including phased rollouts, canaries, shadow testing, and safe rollback mechanisms.
- Contributing to the evolution of deployment tooling for device, edge, and cloud hybrid systems.
- Influencing architectural decisions by injecting reliability, observability, and operational considerations early in design.
- Collaborating with AI/ML engineers, platform engineers, firmware teams, and product partners to deliver robust, dependable user experiences.
- 10+ years of experience in Site Reliability Engineering, Production Engineering, Dev Ops, or large-scale distributed systems operations
- Bachelor’s Degree in Computer Science, Engineering, or a related technical discipline
- Strong experience running production distributed systems at scale
- Proficiency in at least one modern programming language (e.g., Python, Go, Java, C++)
- Strong understanding of Linux systems, networking fundamentals, and system performance tuning
- Experience with monitoring/observability (metrics, logs, tracing)
- Hands-on experience with cloud environments (Azure, AWS, or GCP)
- Experience in incident management, on-call rotations, and postmortem processes
- Deep experience with Azure cloud services
- Experience with Open Telemetry for end-to-end instrumentation
- Familiarity with Grafana, Prometheus, Loki, Tempo, or similar tools
- Experience supporting AI/ML systems, model serving, or…
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).