AI Infrastructure Architect
Listed on 2026-07-01
-
IT/Tech
Cloud Computing: Infrastructure & Operations, AI Engineer (Applied/Software), SRE/Site Reliability, IT Infrastructure
YOU ARE
As a hands‑on Infrastructure Architect, you are an experienced engineer with several years in infrastructure engineering who now takes on more complex, higher‑impact work designing and optimizing the AI and machine learning infrastructure that powers real‑world applications. Working alongside senior architects and engineers — and increasingly leading your own work streams — you apply proven skills in coding, testing, configuring, deploying, monitoring, and troubleshooting AI systems and the infrastructure they run on.
Day to day, you architect and optimize infrastructure components, write and review code and deployment scripts, design and tune cloud and on‑premises compute resources such as GPU clusters and distributed training environments, deploy AI systems and models into production, and build and optimize data pipelines that feed AI and ML workflows. You optimize the computational stack for performance, cost, power, and scalability, monitor AI systems and infrastructure health across both Infra Ops and MLOps disciplines, perform AI monitoring to track model and system performance, and independently troubleshoot and resolve complex issues across the stack.
You also mentor junior engineers, contribute to architectural decisions, and help establish best practices. This is a hands‑on, ownership‑driven role where you apply and deepen your expertise across modern tools and platforms — including container orchestration, model serving, CI/CD pipelines, Infra Ops, MLOps, and AI monitoring — while making meaningful contributions to infrastructure that enables AI‑driven business outcomes.
- Write, review, and debug code, scripts, and infrastructure‑as‑code for AI infrastructure, automation, and tooling, setting standards for quality across the team.
- Architect, configure, and provision compute resources across cloud and on‑premises environments, including GPU clusters and distributed training setups, optimizing for performance and utilization.
- Design and maintain deployment automation and CI/CD pipelines to support reliable, repeatable releases of AI systems, models, and applications.
- Deploy AI systems, models, and data pipelines into production, defining and improving the processes and best practices others follow.
- Lead container orchestration and model serving using tools such as Docker, Kubernetes, and model deployment frameworks.
- Architect and optimize the computational stack for performance, power, cost, and scalability, balancing trade‑offs against business goals.
- Evaluate and select tools, frameworks, and platforms, making recommendations that shape the infrastructure roadmap.
- Integrate AI models and systems into existing enterprise systems, ensuring interoperability, security, and regulatory compliance.
- Own AI monitoring and infrastructure health across Infra Ops and MLOps, tracking performance, reliability, and utilization, and driving remediation.
- Independently troubleshoot and resolve complex issues across the computational stack — hardware, networking, software, and models — and lead root‑cause analysis.
- Mentor junior engineers and lead code reviews, providing technical direction and supporting their growth.
- Define and document architecture standards, processes, and procedures, and apply security, cost‑efficiency, and scalability best practices across the infrastructure.
- Bachelor's Degree in Computer Science, Computer Engineering, related Engineering field
- Practical experience in coding, building, monitoring, troubleshooting applications of AI/ML models; selecting, designing and infrastructure for deploying and running them on premise or on public cloud.
- Strong understanding of AI and machine learning as a subject.
- Strong understanding of computing infrastructure as a subject, preferred knowledge of AI infrastructure.
- Proficiency in programming languages such as Python, Java, or C++.
- Experience with data pipeline and workflow management tools (e.g., Apache Airflow, Kubeflow).
- Strong problem‑solving skills and ability to work in a fast‑paced environment.
- Excellent communication and collaboration skills.
- Proven experience in AI/ML infrastructure engineering or related roles on a hyperscaler platform for deploying large scale solutions.
We believe that no one should be discriminated against because of their differences. All employment decisions shall be made without regard to age, race, creed, color, religion, sex, national origin, ancestry, disability status, sexual orientation, gender identity or expression, marital status, citizenship status or any other basis as protected by applicable law. Our rich diversity makes us more innovative, more competitive, and more creative, which helps us better serve our clients and our communities.
#J-18808-LjbffrTo Search, View & Apply for jobs on this site that accept applications from your location or country, tap here to make a Search: