Senior Cloud Infrastructure Engineer
Listed on 2025-11-14
-
IT/Tech
Systems Engineer, Cloud Computing, Network Engineer, SRE/Site Reliability
Senior Cloud Infrastructure Engineer
LocationSan Francisco, CA. On‑Site only. Must live within commuting distance of San Francisco or be willing to relocate.
ModalityOn‑Site only.
Employment TypeSalaried W2 Full‑Time.
Salary Range$/yr - $/yr
About The CompanyWe represent a pioneering open source technology company in San Francisco that is transforming the way creators interact with generative AI. The team behind a powerful, node‑based visual interface gives artists, developers, and innovators the ability to design, control, and customize AI workflows with complete flexibility. Their platform allows users to connect modular components, build complex pipelines, and run everything locally with impressive speed and precision.
Their mission is to make generative AI open, transparent, and accessible to everyone. Built around community collaboration and creative empowerment, their tools help users experiment freely and bring their ideas to life.
The Role
In this role, you will lead on designing, deploying, and maintaining large‑scale distributed systems that power AI workloads. You will collaborate closely with core engineers to shape the company’s long‑term infrastructure vision while ensuring scalability, performance, and reliability across environments.
What You’ll Do- Design, build, and maintain the core infrastructure that powers AI workloads at scale
- Manage and automate GPU compute clusters using tools such as Python, Kubernetes, Terraform, and Ansible
- Architect and operate systems for orchestration, observability, distributed storage, and networking
- Ensure reliability, scalability, and performance across production environments
- Collaborate closely with core engineers to design infrastructure for new features and systems
- Contribute to technical strategy and long‑term infrastructure vision
- Drive best practices for infrastructure automation, deployment, and monitoring
- 5+ years experience as an Infrastructure Engineer or Site Reliability Engineer building and operating large‑scale distributed systems
- Skilled in Python and comfortable working with infrastructure‑as‑code tools such as Terraform and Ansible
- Familiar with container orchestration systems such as Kubernetes and related tooling like FluxCD, Prometheus, and Grafana
- Capable of managing high‑performance GPU environments across cloud and bare‑metal setups
- Highly adaptable, resourceful, and motivated by building things from the ground up
- Excited to work in a small, fast‑growing team where autonomy and accountability are key
- Comfortable working on‑site in a startup setting where collaboration and speed matter most
- Experience contributing to or maintaining open‑source projects
- Background working with AI infrastructure, ML pipelines, or GPU orchestration
- Strong computer science fundamentals and ability to work across different programming languages or frameworks
fluxcd, ansible, kubernetes, grafana, prometheus, python, terraform, infrastructure
Seniority LevelMid‑Senior level
Job FunctionInformation Technology
IndustriesHuman Resources Services
#J-18808-Ljbffr(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).