Senior Site Reliability Engineer, Supply
Listed on 2026-01-03
-
IT/Tech
Systems Engineer, Cloud Computing
Senior Site Reliability Engineer, Supply
As a key member of the Supply Engineering team, you will enable the sustainable, reliable growth of Mithril’s compute supply, overseeing technical operations and managing compute partner relationships.
Responsibilities- Design, deploy, and manage scalable, secure, and highly available Kubernetes clusters in cloud and on‑premises environments.
- Execute and develop Ansible playbooks for routine maintenance, load testing, and system burn‑in across the Mithril fleet.
- Deploy and oversee monitoring systems such as Grafana to proactively detect issues and anomalies.
- Establish and uphold service level objectives (SLOs) and service level indicators (SLIs) to gauge system reliability.
- Lead or participate in incident response and root‑cause analysis.
- Provide regular updates on machine operability and notify partners of disruptions to maintain availability and confidence.
- Act as the primary liaison with suppliers, maintaining regular meetings to communicate requirements and address inquiries.
- Coordinate cross‑functional supply‑related initiatives, ensuring stakeholders are aligned for upcoming changes or maintenance events.
- Proven experience deploying, scaling, and maintaining production‑grade Kubernetes clusters across cloud or on‑prem environments.
- Bachelor’s degree in Computer Science, Engineering, or related field, or equivalent experience.
- Experience with Linux system administration and command‑line interfaces.
- Ability to create technical documentation and specifications.
- Proficiency in scripting and automation (Python, Bash, or similar).
- Understanding of key infrastructure metrics (CPU, memory, network utilization, error rates).
- Knowledge of data center operations: disaster recovery, maintenance schedules, capacity planning.
- Strong written and verbal communication skills, able to translate technical concepts.
- Project management experience and ability to handle multiple priorities.
- Demonstrated problem‑solving and analytical thinking skills.
- Experience leading or participating in incident response and root‑cause analysis.
- Familiarity with GPU/CPU cluster management and optimization.
- Proficiency with Git or similar version control.
- Experience with Prometheus or Grafana monitoring and observability tools.
- Experience in technical training or presenting content.
- Prior SRE experience in the AI/ML domain.
- Experience at scale infrastructure and hardware lifecycle management (RMA).
- Experience in vendor‑facing roles.
- Health, dental, and vision coverage for you and dependents.
- 401k plan with 4% company match.
- 21 days PTO & 14 company holidays; including 2 floating holidays.
Remuneration bracket: $170,000‑$230,000, with possible adjustments for outstanding qualifications.
In‑Office RequirementPrimary work location is Palo Alto or San Francisco, with weekly on‑site collaboration. Flexible arrangements are possible for extenuating circumstances.
Equal Opportunity EmployerMithril maintains a strict commitment to equal opportunity employment practices. All applicants are evaluated without regard to race, color, religion, creed, national origin, age, sex, gender, marital status, sexual orientation, disability, veteran status, citizenship, or any other protected class.
#J-18808-Ljbffr(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).