Senior Principal AI Agent/ML Engineer; OCI
Listed on 2026-06-19
-
Software Development
AI Engineer (Applied/Software), Software Architect, DevOps, Cloud Engineer - Software
- Job Identification 336161
- Job Category Product Development
- Posting Date 06/11/2026, 06:06 PM
- Job Type Regular Employee
- Does this position require a security clearance? No
- Years 10+ years
- Applicants are required to read, write, and speak the following languages English
The Senior Principal AI Agent / ML Software Engineer is a Senior Staff-level, hands-on technical leadership role responsible for defining, building, and operating next-generation AI systems on Oracle Cloud Infrastructure (OCI). This person will set architecture and engineering direction for production-grade agentic AI platforms, autonomous workflows, scalable inference infrastructure, and enterprise AI applications used in large-scale, business-critical environments.
This role requires a proven engineer who can translate ambiguous product and platform goals into durable technical strategy, lead multi-team execution without direct authority, and remain deeply hands-on in design, code, reviews, operations, and incident follow-up. The ideal candidate combines deep distributed systems experience with practical AI-native engineering, including orchestration of LLMs, tools, APIs, memory, retrieval, evaluation, guardrails, and cloud services.
The expectation is to ship, scale, and operate reliable, secure, observable, and cost-aware AI platform systems while raising the technical bar for engineers across the organization.
- Serve as a senior technical owner for OCI AI platform capabilities, including agent execution, inference systems, model serving, AI workflow orchestration, evaluation, and observability.
- Design, architect, and deliver scalable agentic AI systems capable of reasoning, planning, tool use, workflow execution, multi-step task orchestration, and safe human-in-the-loop escalation.
- Build production-grade services for tool calling, agent memory, context management, Model Context Protocol (MCP) integration, vector retrieval, multi-agent coordination, policy enforcement, and evaluation.
- Lead architecture across distributed services optimized for low latency, high throughput, GPU efficiency, reliability, cost, operability, and secure multi-tenant operation.
- Define service boundaries, APIs, data models, state management, consistency tradeoffs, failure modes, SLIs/SLOs, rollout strategies, and operational readiness criteria for AI platform services.
- Drive technical strategy across infrastructure, platform, security, data, and application engineering teams, converting broad goals into executable multi-quarter plans and measurable milestones.
- Integrate AI agents securely and reliably with enterprise APIs, cloud services, databases, identity systems, secrets management, and external systems.
- Establish Agent Ops and LLMOps practices for tracing, monitoring, eval suites, regression testing, experimentation, safety guardrails, prompt/tool versioning, and production reliability.
- Evaluate and operationalize emerging technologies in generative AI, agentic workflows, inference optimization, long-context systems, reasoning models, AI developer tooling, and agentic-first development.
- Drive engineering excellence through code reviews, design reviews, test strategy, deployment automation, incident analysis, documentation, and AI-assisted development practices using tools such as Codex, Claude Code, Cursor, Copilot, or similar systems.
- Mentor Staff and senior engineers, raise architectural standards, and influence engineering practices across OCI without requiring direct management authority.
- Own critical production outcomes, including reliability, performance, security posture, cost efficiency, and supportability for the systems delivered.
- Bachelor's, Master's, or Ph.D. in Computer Science, AI/ML, Engineering, or a related field, or equivalent practical experience.
- 12+ years of professional software engineering experience, including significant ownership of production systems; or equivalent experience demonstrating Senior Staff / Principal-level impact.
- Proven track record as a Staff, Senior Staff, Principal, or equivalent technical leader influencing architecture and execution across multiple teams.
Deep experience designing, building, and operating high-scale distributed systems, cloud services, infrastructure platforms, or AI/ML platform services.
- Practical experience with orchestration frameworks such as Lang Graph, Lang Chain, CrewAI, Auto Gen, Llama Index, or similar ecosystems.
- Deep understanding of LLM application patterns, including prompt design, structured outputs, function/tool calling, context management, RAG, memory, tool safety, and evaluation.
- Strong programming skills in Python and ability to contribute high-quality production code, reviews, tests, and debugging in complex distributed environments.
- Strong expertise with Kubernetes, Docker, cloud-native infrastructure, service-to-service communication, scalability, fault tolerance, observability, and performance analysis.
- Experience defining SLIs/SLOs, production…
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).