Principal Member of Technical Staff, AI Infrastructure
Listed on 2026-05-16
-
Software Development
Software Engineer, Cloud Engineer - Software
Principal Member of Technical Staff, AI Infrastructure
Austin, TX, United States
Job
Category:
Product Development
Posting Date: 05/13/2026
Type:
Regular Employee
Security Clearance:
No
Years of
Experience:
6 to 10+ years
Languages:
English (read, write, speak)
OCI is building the world’s largest AI clusters and is leading the effort by creating a GPU-focused cloud. This role involves creating systems that allow customers to scale from tens to thousands of GPUs while maintaining performance.
We need a highly skilled distributed systems engineer to scale and optimize AI infrastructure components like GPU control plane and GPU data plane, providing computing resources for AI workloads. The engineer will provide technical leadership, clarify ambiguous problems, and drive innovative solutions in collaboration with cross‑functional teams.
Responsibilities- Design and develop scalable AI compute infrastructure, including GPU control plane and GPU data plane, to optimize customer experience and workload performance.
- Develop “best‑in‑class” AI compute infrastructure that is modular, secure, reliable, diagnosable, monitored, compliant, and reusable.
- Collaborate with development, operations, and product teams to understand requirements and design orchestration solutions.
- Mentor junior developers and promote modern software engineering practices: data telemetry for decisions, well‑defined interfaces, design reviews, coding standards, code reviews, unit integration testing, and production monitoring.
- Develop benchmark metrics and automation to track performance and reliability across customer workloads and lower infrastructure layers.
- BS (or equivalent) in Computer Science, Engineering, or related field.
- 6+ years of software development experience with languages such as C, C++, C#, Java, Go, Rust.
- 3+ years designing and developing large‑scale infrastructure, distributed systems, and services.
- 1+ year of technical leadership experience, providing clarity to cross‑functional teams.
- Systematic problem‑solving, strong communication, sense of ownership, and drive.
- Ability to adapt to a fast‑paced, dynamic environment and manage multiple tasks and priorities.
- Experience managing cloud infrastructure with hundreds of thousands of servers.
- Experience with Docker and Kubernetes.
- Experience scheduling high‑performance workloads on Kubernetes or Slurm.
US:
Hiring range $96,800–$223,400 per annum. Bonus and equity may apply.
Benefits include medical, dental, vision, disability coverages; life insurance; 401(k) with company match; paid time off; paid holidays; paid sick leave; parental leave; adoption assistance; employee stock purchase plan; financial planning; voluntary benefits.
EEO StatementOracle is an Equal Employment Opportunity Employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, national origin, sexual orientation, gender identity, disability, protected veteran status, or any other characteristic protected by law. Oracle will consider qualified applicants with arrest and conviction records pursuant to applicable law.
#J-18808-Ljbffr(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).