DevOps Engineer
Listed on 2026-02-24
-
IT/Tech
Systems Engineer, Cloud Computing
About Us
webAI is pioneering the future of artificial intelligence by establishing the first distributed AI infrastructure dedicated to personalized AI. We recognize the evolving demands of a data-driven society for scalability and flexibility, and we firmly believe that the future of AI lies in distributed processing at the edge, bringing computation closer to the source of data generation.
Our mission is to build a future where a company's valuable data and intellectual property remain entirely private, enabling the deployment of large-scale AI models directly on standard consumer hardware without compromising the information embedded within those models. We are developing an end-to-end platform that is secure, scalable, and fully under the control of our users, empowering enterprises with AI that understands their unique business.
We are a team driven by truth, ownership, tenacity, and humility
, and we seek individuals who resonate with these core values and are passionate about shaping the next generation of AI.
We are seeking a Staff Dev Ops Engineer to architect, build, and scale secure infrastructure for deploying AI workloads across cloud and edge environments. This is a high-impact, staff-level individual contributor role where you will drive infrastructure strategy, lead technical initiatives, and serve as the subject matter expert on cloud architecture, security best practices, and platform reliability.
You will design scalable, automated infrastructure solutions that enable our AI platform to operate efficiently across diverse deployment scenarios—from public cloud to on-premises and edge computing environments. This role requires deep technical expertise, architectural thinking, and the ability to translate complex requirements into production-ready infrastructure automation.
Responsibilities- Design and architect secure, scalable cloud and edge infrastructure for deploying AI workloads across multi-cloud (AWS, Azure, GCP) and hybrid environments
- Build and maintain production-grade Infrastructure as Code (IaC) using Terraform, Ansible, or Pulumi, managing 100+ resources with Git Ops workflows and automated validation
- Design and operate production Kubernetes clusters optimized for AI/ML workloads with GPU support, implementing container security, multi-tenancy, and resource optimization
- Implement secure CI/CD pipelines with integrated security controls (SAST, DAST, vulnerability scanning, secrets management) and automated deployment workflows for containerized AI models
- Lead MLOps infrastructure initiatives including model deployment pipelines, versioning, feature stores, experiment tracking, and monitoring for model performance and drift
- Design comprehensive observability and monitoring using Prometheus, Grafana, ELK, or Datadog with distributed tracing, APM, and real-time alerting aligned to SLIs/SLOs
- Implement security best practices including least-privilege access, encryption at rest/in transit, network segmentation, and automated compliance validation
- Lead incident response and reliability initiatives, participate in on-call rotation, conduct post-mortems, and drive continuous improvement for system reliability
- Architect disaster recovery and business continuity strategies with automated backup, failover, and recovery processes
- Develop reusable infrastructure modules and templates to accelerate environment provisioning and standardize deployment patterns across teams
- Mentor mid-level and senior engineers on cloud architecture, Dev Ops best practices, and platform reliability through design reviews and technical guidance
- Drive technical documentation and knowledge sharing, including runbooks, architecture decision records (ADRs), and infrastructure standards
- 7+ years of hands‑on experience in Dev Ops, Site Reliability Engineering, or Infrastructure Engineering with proven track record of architecting production systems
- Expert‑level proficiency with Docker, Kubernetes (CKA/CKAD preferred), and cloud‑native technologies in production environments
- 5+ years implementing Infrastructure as Code with Terraform, Ansible, or Pulumi, managing large‑scale (50+) cloud resources
- D…
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).