Senior Site Reliability Engineer,Supply Job San Francisco area,California USA,IT/Tech

Senior Site Reliability Engineer, Supply

As a key member of the Supply Engineering team, you will enable the sustainable, reliable growth of Mithril’s compute supply, overseeing technical operations and managing compute partner relationships.

Responsibilities

Design, deploy, and manage scalable, secure, and highly available Kubernetes clusters in cloud and on‑premises environments.
Execute and develop Ansible playbooks for routine maintenance, load testing, and system burn‑in across the Mithril fleet.
Deploy and oversee monitoring systems such as Grafana to proactively detect issues and anomalies.
Establish and uphold service level objectives (SLOs) and service level indicators (SLIs) to gauge system reliability.
Lead or participate in incident response and root‑cause analysis.
Provide regular updates on machine operability and notify partners of disruptions to maintain availability and confidence.
Act as the primary liaison with suppliers, maintaining regular meetings to communicate requirements and address inquiries.
Coordinate cross‑functional supply‑related initiatives, ensuring stakeholders are aligned for upcoming changes or maintenance events.

Requirements

Proven experience deploying, scaling, and maintaining production‑grade Kubernetes clusters across cloud or on‑prem environments.
Bachelor’s degree in Computer Science, Engineering, or related field, or equivalent experience.
Experience with Linux system administration and command‑line interfaces.
Ability to create technical documentation and specifications.
Proficiency in scripting and automation (Python, Bash, or similar).
Understanding of key infrastructure metrics (CPU, memory, network utilization, error rates).
Knowledge of data center operations: disaster recovery, maintenance schedules, capacity planning.
Strong written and verbal communication skills, able to translate technical concepts.
Project management experience and ability to handle multiple priorities.
Demonstrated problem‑solving and analytical thinking skills.
Experience leading or participating in incident response and root‑cause analysis.

Nice to Have

Familiarity with GPU/CPU cluster management and optimization.
Proficiency with Git or similar version control.
Experience with Prometheus or Grafana monitoring and observability tools.
Experience in technical training or presenting content.
Prior SRE experience in the AI/ML domain.
Experience at scale infrastructure and hardware lifecycle management (RMA).
Experience in vendor‑facing roles.
Health, dental, and vision coverage for you and dependents.
401k plan with 4% company match.
21 days PTO & 14 company holidays; including 2 floating holidays.

Salary Range Information

Remuneration bracket: $170,000‑$230,000, with possible adjustments for outstanding qualifications.

In‑Office Requirement

Primary work location is Palo Alto or San Francisco, with weekly on‑site collaboration. Flexible arrangements are possible for extenuating circumstances.

Equal Opportunity Employer

Mithril maintains a strict commitment to equal opportunity employment practices. All applicants are evaluated without regard to race, color, religion, creed, national origin, age, sex, gender, marital status, sexual orientation, disability, veteran status, citizenship, or any other protected class.

#J-18808-Ljbffr


Increase/decrease your Search Radius (miles)



Job Posting Language