×
Register Here to Apply for Jobs or Post Jobs. X

Site Reliability Engineer - NYC

Job in New York, New York County, New York, 10261, USA
Listing for: mistral
Full Time position
Listed on 2026-06-17
Job specializations:
  • IT/Tech
    SRE/Site Reliability, Cloud Computing: Infrastructure & Operations, Systems Engineer
Salary/Wage Range or Industry Benchmark: 100000 - 125000 USD Yearly USD 100000.00 125000.00 YEAR
Job Description & How to Apply Below
Location: New York

About Mistral

At Mistral AI, we believe in the power of AI to simplify tasks, save time, and enhance learning and creativity. Our technology is designed to integrate seamlessly into daily working life.

We democratize AI through high-performance, optimized, open-source and cutting-edge models, products and solutions. Our comprehensive AI platform is designed to meet enterprise needs, whether on‑premises or in cloud environments. Our offerings include le Chat, the AI assistant for life and work.

We are a dynamic, collaborative team passionate about AI and its potential to transform society.

Our diverse workforce thrives in competitive environments and is committed to driving innovation. Our teams are distributed between France, USA, UK, Germany and Singapore. We are creative, low‑ego and team‑spirited.

Role Summary

We are seeking highly experienced Site Reliability Engineers (SRE) to shape the reliability, scalability and performance of our platform and customer facing applications. You will work closely with our software engineers and research teams to ensure our systems meet and exceed our internal and external customers' expectations.

What you will do

As a Site Reliability Engineer, you balance the day‑to‑day operations on production systems with long‑term software engineering improvements to reduce operational toil and foster the reliability, availability, and performance of these systems.

Operations
  • Design, build, and maintain scalable, highly available and fault‑tolerant infrastructures to support our web services and ML workloads
  • Make sure our platform, inference and model training environments are always highly available and enable seamless replication of work environments across several HPC clusters
  • Operate systems and troubleshoot issues in production environments (interrupts, on‑call responses, users admin, data extraction, infrastructure scaling, etc.)
  • Implement and improve monitoring, alerting, and incident response systems to ensure optimal system performance and minimize downtime
  • Implement and maintain workflows and tools (CI/CD, containerization, orchestration, monitoring, logging and alerting systems) for both our client‑facing APIs and large training runs
  • Participate occasionally in on‑call rotations to respond to incidents and perform root cause analysis to prevent future occurrences
Development
  • Drive continuous improvement in infrastructure automation, deployment, and orchestration using tools like Kubernetes, Flux, Terraform
  • Collaborate with AI/ML researchers to develop and implement solutions that enable safe and reproducible model‑training experiments
  • Build a cloud‑agnostic platform offering an abstraction layer between
#J-18808-Ljbffr
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
 
 
 
Search for further Jobs Here:
(Try combinations for better Results! Or enter less keywords for broader Results)
Location
Increase/decrease your Search Radius (miles)
0
200
Filters
Education Level
Experience Level (years)
Posted in last:
Salary