×
Register Here to Apply for Jobs or Post Jobs. X

ML Systems​/Infrastructure Engineer

Job in Harrow, Greater London, HA2, England, UK
Listing for: EPIC Centre
Full Time position
Listed on 2026-02-15
Job specializations:
  • Software Development
    AI Engineer, Machine Learning/ ML Engineer
Salary/Wage Range or Industry Benchmark: 80000 - 100000 GBP Yearly GBP 80000.00 100000.00 YEAR
Job Description & How to Apply Below

London Office

Hybrid

ML Systems/Infrastructure Engineer

Oriole is seeking a talented ML Systems/Infrastructure Engineer to help co‑optimize our AI/ML software stack with cutting‑edge network hardware. You’ll be a key contributor to a high‑impact, agile team focused on integrating middleware communication libraries and modelling the performance of large‑scale AI/ML workloads.

Key Responsibilities
  • Design and optimize custom GPU communication kernels to enhance performance and scalability across multi‑node environments.
  • Develop and maintain distributed communication frameworks for large‑scale deep learning models, ensuring efficient parallelization and optimal resource utilization.
  • Profile, benchmark, and debug GPU applications to identify and resolve bottlenecks in communication and computation pipelines.
  • Collaborate closely with hardware and software teams to integrate optimized kernels with Oriole’s next‑generation network hardware and software stack.
  • Contribute to system‑level architecture decisions for large‑scale GPU clusters, with a focus on communication efficiency, fault tolerance, and novel architectures for advanced optical network infrastructure.
Required Skills & Experience
  • Proficient in C++ and Python, with a strong track record in high‑performance computing or machine learning projects.
  • Expertise in GPU programming with CUDA, including deep knowledge of GPU memory hierarchies and kernel optimization.
  • Hands‑on experience debugging GPU kernels using tools such as Cuda-gdb, Cuda Memcheck, NSight Systems, PTX, and SASS.
  • Strong understanding of communication libraries and protocols, including NCCL, NVSHMEM, OpenMPI, UCX, or custom collective communication implementations.
  • Familiarity with HPC networking protocols/libraries such as RoCE, Infiniband, Libibverbs, and libfabric.
  • Experience with distributed deep learning/MoE frameworks, including PyTorch Distributed, vLLM, or DeepEP.
  • Solid understanding of deploying and optimizing large‑scale distributed deep learning workloads in production environments, including Linux, Kubernetes, SLURM, OpenMPI, GPU drivers, Docker, and CI/CD automation.
About Oriole Networks

Accelerating AI in a Low Carbon World – Oriole Networks is a photonic networking company, developing disruptive technologies for AI/ML and HPC networking that will revolutionise data centres.

#J-18808-Ljbffr
Note that applications are not being accepted from your jurisdiction for this job currently via this jobsite. Candidate preferences are the decision of the Employer or Recruiting Agent, and are controlled by them alone.
To Search, View & Apply for jobs on this site that accept applications from your location or country, tap here to make a Search:
 
 
 
Search for further Jobs Here:
(Try combinations for better Results! Or enter less keywords for broader Results)
Location
Increase/decrease your Search Radius (miles)

Job Posting Language
Employment Category
Education (minimum level)
Filters
Education Level
Experience Level (years)
Posted in last:
Salary