Infrastructure Site Reliability Engineer
Listed on 2026-05-30
-
IT/Tech
Systems Engineer, Cloud Computing, IT Support, Network Engineer
About Radiant
Radiant is redefining how AI infrastructure is built.
We design and operate AI-native cloud platforms engineered for sovereignty, performance, and scale. Our infrastructure powers GPU-native workloads, multi-tenant control planes, and high-performance AI systems designed for the most demanding environments.
We are not building a generic cloud. We are building purpose-built AI infrastructure - from powered land, to compute, to software .
As we scale our platform and expand our engineering organisation, we are looking for leaders who can build strong teams, uphold high standards, and deliver reliably at pace.
Job Summary:
We’re looking for an experienced Infrastructure Site Reliability Engineer to run and evolve our infrastructure stack. You’ll contribute across bare-metal, virtualization, and orchestration layers,
keeping things stable and secure 24/7 x 365 — all while mentoring teammates, improving process and automation as well as helping translate deep technical concepts for a wide range of collaborators and customers.
What You’ll Do :- Deploy and operate resilient, scalable infrastructure supporting AI/HPC workloads
- Optimize Linux system configuration, BIOS/firmware, kernel, and disk subsystem for performance
- Configure, monitor and manage bare-metal infrastructure using IPMI, Redfish, etc
- Build and maintain automation scripts and infrastructure as code to support platform lifecycle, as well as simplifying troubleshooting for Incident resolution and provision of tooling for our support organisation
- Apply ITSM frameworks:
Incident, Major Incident, Change Management, and service improvement. - Maintain and enhance ’s observability stack:
Prometheus, Grafana, and custom monitoring integrations - Operate and support services in 24x7 production environments, including on-call rotation
- Contribute to Incident postmortem analyses, root cause analysis, document learnings, and automate remediations
- Mentor junior engineers and act as an Operational requirements consultant to other departments
- Communicate technical decisions clearly to non-technical stakeholders and customers
- Uphold a culture of: do, document, automate
- Willingness to cross train with Platform Engineering/Platform SRE to fully support both our infrastructure and platform stacks.
- Willingness to cross train with HPC Engineering, supported by NVIDIA to enhance our
- HPC supportability offering
- 5+ Years Proven experience in globally scaled, performance-intensive environments operating to a 24/7 support model
- Expert-level Linux administration, especially Ubuntu distributions
- Proficiency in system tuning, disk I/O optimization, and hardware-level performance tweaks
- Familiarity with Out of Band management tools (IPMI, Redfish, PXE, etc.)
- Strong networking fundamentals: TCP/IP, DNS, DHCP, VLANs, routing, switching
- Strong experience with infrastructure scripting and automation (Bash, Python, Ansible)
- Deep understanding of observability principles and tools (Prometheus, Grafana)
- Hands-on experience operating orchestration platforms (Kubernetes, MAAS, Tinkerbell)
- Strong grasp of ITSM and service operation best practices
- Excellent communication and mentorship skills
- Comfortable interfacing with internal stakeholders and external customers
- Bonus:
Knowledge of HPC workloads and GPU-based infrastructure - Bonus:
Experience with Infini Band networks and HPC performance tuning
- Bachelor or Masters Level degree in Computer Science, Engineering or related field, or equivalent experience.
- LPIC Certifications
- ITIL Foundation level qualification or equivalent experience
- You approach problems with a systems mindset - balancing practical execution with long‑term scalability
- You elevate the team, setting high standards for technical quality and engineering excellence.
- You hold yourself and others accountable - giving direct feedback and expecting the same
- You take initiative, owning challenges end-to-end and proactively driving solutions.
- You invest in others, mentoring to build both capability and confidence.
- You communicate clearly - translating complexity into clarity across engineering and business audiences
What sets us apart is…
To Search, View & Apply for jobs on this site that accept applications from your location or country, tap here to make a Search: