Infrastructure Management and Provisioning Engineer
Listed on 2026-07-03
-
IT/Tech
Systems Engineer, Cloud Computing: Infrastructure & Operations, IT Infrastructure, SRE/Site Reliability
Infrastructure Provisioning And Management Engineer
At Roche you can show up as yourself, embraced for the unique qualities you bring. Our culture encourages personal expression, open dialogue, and genuine connections, where you are valued, accepted and respected for who you are, allowing you to thrive both personally and professionally. This is how we aim to prevent, stop and cure diseases and ensure everyone has access to healthcare today and for generations to come.
Join Roche, where every voice matters.
As an Infrastructure Provisioning and Management Engineer within the Accelerated Compute Engineering (ACE) team, you will be responsible for overseeing and advancing our core infrastructure management and provisioning tech stack. This role has a strong focus on driving configuration-as-code, infrastructure-as-code (IaC), and modern automated provisioning best practices across our high-performance compute (HPC) and industry-leading AI Factory.
You will own the lifecycle, deployment, and optimization of bare-metal and virtualized compute environments that power Roche's advanced computing initiatives. By treating infrastructure strictly as code and eliminating manual configurations, you will ensure our advanced clusters are highly reproducible, securely patched, and rapidly scalable to meet the evolving demands of computational science and large-scale AI workloads.
Hosting and Infrastructure (HI) provides mission-critical on-premise infrastructure, cloud hosting, connectivity, and technology products that enable all functions at every Roche site to develop, innovate, connect, and deliver compliant digital products across the Roche Enterprise.
The Value Streams - Accelerated Compute Engineering (ACE) Team is focused on driving both customer success and platform success by acting as a center of excellence and delivery for the High Performance Compute and AI Infrastructure supporting AI and HPC use cases across Roche. This team facilitates seamless onboarding and adoption for business vertical customers needing accelerated compute—helping those infrastructure consumers with needs optimized for high availability, seamless data transfer, flexibility, speed, and the rapidly changing needs of AI—helping achieve rapid time-to-value.
Job ResponsibilitiesAutomated Provisioning & Cluster Orchestration
- Design, deploy, and manage large-scale automated provisioning systems for multi-node HPC and AI Factory environments.
- Own and maintain the infrastructure management and provisioning tech stack underpinning the orchestration, monitoring, and provisioning of complex GPU and CPU workloads.
- Streamline bare-metal provisioning and node imaging pipelines to ensure minimal downtime and rapid expansion capabilities.
Infrastructure-as-Code (IaC) & Configuration Governance
- Enforce a strict configuration-as-code and infrastructure-as-code mindset, replacing manual interventions with repeatable automation scripts.
- Author, review, and maintain complex Ansible playbooks and roles for configuration management, patch deployment, and compliance drift remediation.
- Establish robust CI/CD pipelines using Git Lab to test, validate, and deploy infrastructure changes safely across development, staging, and production clusters.
Operating System Engineering & Lifecycle Management
- In partnership with Enterprise OS teams, standardize and manage operating system builds, with dual proficiency across HPC and AI Factory platforms.
- Utilize solutions such as Red Hat Image Builder and NVIDIA Base Command Manager to create optimized, compliant, and secure custom golden images tailored for AI and high-performance computing workloads.
- Manage OS life cycles, including kernel tuning, automated package updates, and vulnerability management, ensuring alignment with global security standards.
Platform Reliability & Collaboration
- Implement proactive monitoring and alerting for infrastructure provisioning health, node availability, and configuration drifts.
- Address and help resolve complex, systemic infrastructure failures, contributing to post-mortem analyses to continuously improve platform resilience.
Education / Experience
- Bachelor's or…
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).