Infrastructure Engineer
Listed on 2026-05-30
-
IT/Tech
Systems Engineer, Cloud Computing, IT Infrastructure, SRE/Site Reliability
Nscale is the GPU cloud engineered for AI. We provide cost‑effective, high‑performance infrastructure for AI start‑ups and large enterprise customers. Nscale enables AI‑focused companies to achieve superior results by reducing the complexity of AI development. Our GPU cloud bolsters technical capabilities and directly supports strategic business outcomes, including cost management, rapid innovation, and environmental responsibility.
We thrive on a culture of relentless innovation, ownership, and accountability, where every team member takes pride in their work and drives it with excellence and urgency. As an Nscaler, you’ll build trust through openness and transparency, where everyone is inspired to do their best work. If you join our team, you’ll be contributing to building the technology that powers the future.
Aboutthe Role
We’re hiring an Infrastructure Engineer to design, implement, operate, and continuously improve the infrastructure platforms that support both internal and customer‑facing services at Nscale.
This role sits within the Operational Engineering team in Engineering
, where you’ll work across the infrastructure stack below the hypervisor with a strong focus on Open Stack, storage systems, Proxmox, DNS, DHCP, and infrastructure automation
. You’ll collaborate closely with internal teams to ensure infrastructure meets performance, availability, and security requirements, while also serving as a 3rd/4th line escalation point for complex issues.
Your work will directly support the reliability, scalability, automation, and security of the platforms that power Nscale’s GPU cloud. This is a high‑impact role for someone who wants to shape core infrastructure, improve operational excellence, and bring deep technical expertise to both delivery and ongoing evolution of critical systems.
What you'll be doingInfrastructure Design & Operations
- Design scalable and resilient infrastructure platforms across Open Stack, Proxmox, Ceph
, and core supporting services. - Implement infrastructure components that underpin internal and customer‑facing services.
- Operate critical infrastructure layers below the hypervisor with a focus on stability and performance.
- Maintain essential services such as DNS, DHCP
, and configuration management tooling. - Improve automation for provisioning, monitoring, patching, and recovery.
- Use infrastructure‑as‑code and configuration management tools to standardise operations.
- Drive continuous improvement across infrastructure reliability, scalability, and operational efficiency.
- Support repeatable and maintainable platform operations through automation‑first approaches.
- Act as a 3rd/4th line escalation point for complex infrastructure issues.
- Partner with support teams to resolve incidents and restore services effectively.
- Investigate root causes of infrastructure problems and contribute to long‑term fixes.
- Participate in on‑call rotations and incident response activities for critical infrastructure.
Cross‑Functional Collaboration & Technical Guidance
- Collaborate with internal teams to ensure solutions meet performance, availability, and security requirements.
- Contribute to infrastructure roadmap planning, including capacity management and performance tuning
. - Introduce new technologies that strengthen the infrastructure stack over time.
- Provide technical expertise to pre‑sales and other groups on infrastructure capabilities and best practices.
Standards, Security & Compliance
- Ensure infrastructure platforms adhere to compliance, security, and operational standards.
- Apply best practices to the operation and evolution of infrastructure services.
- Support secure and well‑governed platform delivery across the environments you own.
- Infrastructure availability and resilience
- Automation coverage for provisioning, patching, monitoring, and recovery
- Complex incident resolution and root cause remediation
- Capacity management and performance tuning effectiveness
- Strong experience deploying, managing, upgrading, and operating large Open Stack clusters
- Strong experience deploying, managing, and automating Proxmox
- Strong Python and Bash skills
- Strong troubleshooting experience with Linux and services running…
To Search, View & Apply for jobs on this site that accept applications from your location or country, tap here to make a Search: