Principal Lab Services Engineer Job Rhondda Wales UK,IT/Tech

Location: Rhondda

Nscale is the GPU cloud engineered for AI. We provide cost-effective, high-performance infrastructure for AI start-ups and large enterprise customers. Nscale enables AI-focused companies to achieve superior results by reducing the complexity of AI development. Our GPU cloud bolsters technical capabilities and directly supports strategic business outcomes, including cost management, rapid innovation, and environmental responsibility.

At Nscale, our Engineering team plays a critical role in driving the deployment and subsequent management of our infrastructure and software platforms.

We thrive on a culture of relentless innovation, ownership, and accountability, where every team member takes pride in their work and drives it with excellence and urgency. As an Nscaler, you’ll build trust through openness and transparency, where everyone is inspired to do their best work. If you join our team, you’ll be contributing to building the technology that powers the future.

About

The Role

Please note that this role will not require office presence, but rather presence within one of our labs and / or Datacenters.

The Principal Lab Services Engineer will provide senior technical leadership for Nscale’s lab environments, ensuring they are designed, operated, and continuously improved to support hardware development, platform validation, and cluster bring-up.

This role will act as the senior technical owner for lab capability, with a strong focus on next‑generation GPU infrastructure and the physical requirements needed to support it, including space, power, cooling, networking, and operational readiness. The Principal Lab Services Engineer will play a key role in ensuring that development environments are scalable, reliable, and fit for purpose.

The role will also be central to dogfooding our internal automation and operational tooling, helping to validate that clusters can be brought up, configured, and managed using our own platforms and workflows. This person will work closely with engineering, infrastructure, and lab operations teams to improve the efficiency, consistency, and reliability of lab‑based deployment and testing.

What You’ll Be Doing (Responsibilities)

Provide senior technical ownership of lab environments and the standards that underpin their operation.
Lead technical planning for lab capacity, including rack space, power, cooling, cabling, networking, and physical deployment requirements.
Define and maintain technical standards for lab design, hardware installation, cluster bring‑up, operational readiness, and lifecycle management.
Act as the senior technical authority for introducing and supporting next‑generation GPU systems in the lab estate.
Assess infrastructure requirements for new hardware platforms and ensure lab environments are capable of supporting high‑density compute workloads.
Work closely with platform and infrastructure teams to bring up and validate clusters using internal automation and tooling.
Dogfood internal provisioning, configuration, and cluster management tooling, providing feedback to improve reliability, usability, and operational effectiveness.
Troubleshoot complex hardware, infrastructure, and environment issues across lab systems.
Lead root cause analysis and corrective actions for major lab‑related incidents or recurring technical problems.
Produce and maintain runbooks, standards, and technical documentation for lab operations and cluster bring‑up.
Mentor engineers and help raise the technical capability of the wider Lab Services function.
Partner with facilities, vendors, and internal stakeholders to ensure infrastructure dependencies are understood and managed appropriately.
Support roadmap planning by translating future engineering and hardware requirements into technical lab capability.

About You

Deep technical understanding of hardware infrastructure, including servers, storage, networking, and GPU‑based platforms.
Strong knowledge of the physical and operational requirements for next‑generation GPU infrastructure, particularly around power, cooling, space, and resilience.
Strong hands‑on experience with cluster bring‑up, infrastructure validation, and hardware…