×
Register Here to Apply for Jobs or Post Jobs. X

Senior Cluster SRE & Cloud Ops Engineer

Job in San Francisco, San Francisco County, California, 94199, USA
Listing for: Fireworks AI
Full Time position
Listed on 2026-05-29
Job specializations:
  • IT/Tech
    Cloud Computing, Systems Engineer
Salary/Wage Range or Industry Benchmark: 60000 - 80000 USD Yearly USD 60000.00 80000.00 YEAR
Job Description & How to Apply Below

Requirements

  • This role is for someone passionate about operating highly robust, observable, and automated systems and enabling customer successes
  • ,
  • Bachelor's degree in Computer Science, related technical field, or equivalent practical experience
  • ,
  • 5+ years of experience in Site Reliability Engineering, Dev Ops, or a similar role focused on large-scale production systems
  • ,
  • Deep expertise in SRE principles and practices, including SLOs, SLIs, operational automation, incident management, and post-mortems
  • ,
  • Extensive hands-on experience with public cloud platforms (AWS, GCP, Azure), including compute, networking, storage, and database services
  • ,
  • Strong experience with containerization technologies (Docker) and orchestration platforms (Kubernetes)
  • ,
  • Proficiency in designing and implementing robust monitoring, logging, and alerting systems using tools like Prometheus, Grafana, ELK stack, and distributed tracing
  • ,
  • Solid programming/scripting skills in at least one language (e.g., Python, Go) for automation and tool development
  • ,
  • In-depth knowledge of Linux operating systems, networking fundamentals, and system debugging
  • ,
  • Proven ability to troubleshoot complex issues across the entire stack
  • ,
  • Excellent communication, collaboration, and problem-solving skills
  • ,
  • Willingness to participate in on-call rotations
  • ,
  • (Desirable) Experience of managing data center grade GPU clusters with GPU (and peripherals like HBM and RDMA enabled networking) monitoring, troubleshooting, and fixing
  • ,
  • (Desirable) Experience with machine learning infrastructure, model serving, or distributed AI frameworks
  • ,
  • (Desirable) Hands-on experience in security and data protection
What the job involves
  • As a Member of Technical Staff, Cluster Management at Fireworks AI, you will play a critical role in making our world-scale virtual AI cloud reliable, performant, and efficient
  • ,
  • You will apply your expertise in large-scale distributed systems, cloud infrastructure, and operational excellence
  • ,
  • You will partner closely with world-class software engineers and AI experts to scale cutting-edge AI platforms to meet the fast-growing demands and ever-evolving application paradigms
  • ,
  • Ensuring System Reliability:
    Ensure systems are designed and implemented with high availability, scalability, and performance. Focus on fault tolerance, disaster recovery, identifying and removing scaling bottlenecks, and performance optimization across our multi-cloud infrastructure
  • ,
  • Incident Management & Response:
    Lead efforts in incident detection, response, and resolution for critical production issues. Drive post-mortems to identify root causes and implement preventative measures to improve system reliability
  • ,
  • Observability & Monitoring:
    Develop, implement, and maintain comprehensive monitoring, alerting, logging, and tracing solutions to provide deep insights into system health and performance
  • ,
  • Automation & Toil Reduction:
    Identify and automate repetitive operational tasks to reduce toil and improve operational efficiency. Develop tools and scripts to streamline deployments, scaling, and system management
  • ,
  • Capacity Planning & Performance Tuning:
    Work proactively on capacity planning to ensure our infrastructure can gracefully handle growth and peak loads. Optimize system performance and resource utilization
  • ,
  • Reliability Best Practices:
    Collaborate with software engineers to embed reliability principles (e.g., SLOs, SLIs, error budgets) into the development lifecycle, promoting a culture of operational excellence
  • ,
  • On-call Rotation:
    Participate in a periodic on-call rotation to support our production environment and respond to critical alerts
#J-18808-Ljbffr
Position Requirements
10+ Years work experience
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
 
 
 
Search for further Jobs Here:
(Try combinations for better Results! Or enter less keywords for broader Results)
Location
Increase/decrease your Search Radius (miles)
0
200
Filters
Education Level
Experience Level (years)
Posted in last:
Salary