Senior Technical Consultant AI Infrastructure. Carson LilyLifestyle
Listed on 2026-05-30
-
Software Development
Software Engineer
Job Description
Join us at OCI AI Infrastructure, where you will contribute to the development of advanced GPU platforms tailored for AI/ML/HPC workloads. This is a unique opportunity to be part of the AI revolution, enabling customers to effortlessly scale from tens to thousands of GPUs while maintaining exceptional performance.
Our dedicated team is focused on architecting essential improvements in GPU delivery, health monitoring, testing, triage automation, and diagnostic services crucial for distributed AI workloads across numerous GPUs, utilizing cutting-edge technologies like RoCE and Infiniband.
As a Senior Technical Consultant, you will spearhead software design and development for significant components of Oracle's Cloud Infrastructure. We're looking for a lead developer who is a curious problem solver, well-versed in distributed systems, and possesses comprehensive Linux engineering skills with systems triage experience. You should be prepared to delve deep into diverse parts of the stack and low-level systems, designing extensive distributed system interactions.
Emphasis on simplicity and scalability is essential, along with a collaborative and agile work ethic.
In this role, you'll belong to the Compute AI Infrastructure In-Band Engineering team, responsible for the vital infrastructure that automates the testing of new platform shapes ranging from AMD to Intel to Arm and Nvidia. Our operations straddle bare-metal hardware and full-stack orchestration, making it ideal for those with expertise in both distributed systems and Linux/firmware. The team collaborates closely with various components, including OCI APIs, NICs, Smart
NICs, ILOMs, and GPUs, to build high-performance, scalable services and tools that configure, test, and validate server platforms within OCI's expansive Compute and GPU Infrastructure fleet. Partnerships with other teams in Compute, Networking, Security, Data Center Engineering, and Hardware Development will be key to ensuring seamless launching, scaling, and maintenance of new server platforms with minimal operational overhead and high reliability.
Experience direct impacts of your work on cutting-edge GPU hardware that translates to tangible business results.
We are dedicated to equity, inclusion, and respect for all individuals. Our commitment extends to our products and actions, and we continuously seek personal and professional growth. You will be part of a dynamic team filled with motivated and diverse individuals in a flexible work environment where your contributions are valued. If you have a passion for building large-scale distributed infrastructure in the cloud, enjoy working with the latest GPU technology, and possess a knack for distributed systems and Linux development, we invite you to apply!
ResponsibilitiesMinimum Qualifications
BS or MS degree in Computer Science or a related technical field with coding experience, or equivalent practical experience.
In-depth knowledge of operating systems, computer networks, and high-performance applications.
Over 6 years of experience in delivering and operating large-scale production systems (thousands of server instances).
Proficient in multiple programming languages (Java, Python, C, C++, GoLang, shell scripting).
A systematic approach to problem-solving with strong communication skills and a sense of ownership.
Proven ability to successfully deliver products and experience with the full software development lifecycle.
Solid background in Linux systems.
Familiarity with system-level architecture, data synchronization, fault tolerance, and state management.
General experience with enterprise storage, networking, or computing.
Experience with server/GPU hardware architecture and system management.
Knowledge of Infiniband or RoCE networking technologies.
Hands-on experience designing, developing, and operating public cloud service data planes.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).