Architect - Platform Engineer
Listed on 2026-06-21
-
IT/Tech
AI Engineer (Applied/Software), SRE/Site Reliability, IT Infrastructure
This position is listed on behalf of a partner company, who manages all applications and next steps. Our partner is looking for an Architect - Platform Engineer based in the United States.
This is a senior-level architecture role focused on designing and scaling next-generation infrastructure for GenAI and large language model (LLM) workloads in enterprise and production environments. You will define the platform foundations that power distributed training, GPU-accelerated computing, and AI model deployment role blends deep systems engineering expertise with modern cloud-native architecture, requiring strong fluency across Kubernetes, high-performance computing, and AI infrastructure stacks.
You will collaborate with data scientists, ML engineers, and software architects to deliver robust, scalable GenAI platforms. The environment is highly innovative, fast-paced, and centered on cutting‑edge AI transformation across industries. This role is ideal for a hands‑on architect who thrives at the intersection of infrastructure, performance engineering, and applied AI systems.
- Design, build, and optimize scalable infrastructure for GenAI and LLM workloads across multi‑GPU and distributed computing environments.
- Architect and manage high‑performance compute platforms using Slurm clusters and container orchestration systems such as Kubernetes and Open Shift.
- Lead GPU performance profiling, benchmarking, and optimization for distributed training and inference workloads.
- Enable and maintain NVIDIA GPU ecosystem components including CUDA, cuDNN, NCCL, Triton, and related tooling.
- Develop and operationalize GenAI pipelines supporting fine‑tuning, RAG architectures, multi‑modal systems, and LLMOps workflows.
- Build reusable infrastructure‑as‑code templates using tools such as Terraform and Helm to support scalable deployments.
- Collaborate with cross‑functional engineering teams to deploy AI solutions into both research and production environments.
- Drive automation, CI/CD practices, and platform reliability through modern Dev Ops and cloud engineering principles.
- Lead technical architecture discussions with internal and client‑facing stakeholders, providing scalable and production‑ready solutions.
- 10+ years of experience in platform engineering, infrastructure architecture, or high‑performance computing environments.
- Strong hands‑on expertise with Kubernetes and/or Red Hat Open Shift in production‑scale deployments.
- Deep knowledge of GPU computing ecosystems including CUDA, cuDNN, NCCL, Nsight, and TensorRT/Triton.
- Proven experience with Slurm‑based distributed training systems and multi‑GPU optimization.
- Strong Linux systems expertise with performance tuning and infrastructure scaling experience.
- Experience building and deploying GenAI workloads such as LLM fine‑tuning, RAG pipelines, or multimodal AI systems.
- Solid understanding of infrastructure‑as‑code tools including Terraform and Ansible.
- Experience working with cloud GPU environments (AWS, Azure, GCP, OCI) or on‑prem GPU clusters.
- Strong communication and leadership skills with experience mentoring teams and driving architecture decisions.
- Ability to work in client‑facing environments and translate technical complexity into scalable solutions.
- Competitive compensation aligned with senior‑level platform engineering roles
- Remote‑first flexibility across the United States and Canada regions
- Opportunity to work on cutting‑edge GenAI and LLM infrastructure at enterprise scale
- Exposure to leading cloud and AI ecosystems including major hyperscalers and GPU platforms
- Career growth within a fast‑scaling AI‑first engineering organization
- Hands‑on work with advanced technologies such as distributed training, GPU clusters, and LLM systems
- Collaborative, innovation‑driven environment with strong emphasis on learning and technical excellence
- Opportunity to work on high‑impact AI transformation projects across multiple industries.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).