Principal Software Development Engineer - Cloud Platform
Listed on 2026-05-09
-
Software Development
Cloud Engineer - Software, DevOps, AI Engineer, Software Engineer
Principal Software Development Engineer
Our Technology Team partners with teams across Expedia Group to create innovative products, services, and tools to deliver high-quality experiences for travelers, partners, and our employees. A singular technology platform powered by cloud and data provides secure, differentiated, and personalized experiences that drive loyalty and traveler satisfaction.
We are looking for a Principal Engineer to serve as the technical architect for our Cloud Platform organization which sits within our Technology division. As a Principal Engineer reporting to the VP of Cloud Platform, you will be the primary architect of our technical future. The Cloud Platform organization provides the secure, scalable cloud infrastructure, runtime platforms, and developer experience tooling that enable teams across Expedia Group to build, deploy, and operate high-quality, resilient software quickly and safely.
Inthis role, you will:
- Lead Architectural Evolution:
You’ll own the move toward a Cell-Based Architecture. We need to move away from fragile, monolithic clusters and toward isolated, predictable failure domains that allow us to scale horizontally with confidence. - Modernize Kubernetes & Infrastructure:
You’ll define our K8s strategy, focusing on multi-cluster management, service mesh, and automated scaling. You need to ensure our "Golden Path" makes it easy for engineers to do the right thing by default. - Hardened Reliability & Observability:
You will set the standards for SRE across the org. This means moving beyond basic dashboards to causal observability, automated incident response, and rigorous SLO/SLI management. You’ll help us engineer out the root causes of systemic instability. - Optimize Cloud Economics:
You’ll lead our Fin Ops technical strategy. You need to build the tooling and visibility that allows us to understand cost-per-service and ensures our infrastructure spend is directly tied to business value. - Support the Developer Workflow:
While we are embracing AI tools, your job is to build the underlying "agent-friendly" infrastructure. This includes standardized Dev Containers and ephemeral environments that allow for fast, isolated iteration without clobbering shared state.
- Extensive professional software development experience designing, building, and operating large-scale, cloud-native distributed systems and platform services on Kubernetes.
- Proven ownership of critical services or multi-service platforms, including responsibility for system design (LLD), API design, data modeling, deployment, and ongoing operational health.
- Deep expertise with at least one major public cloud provider and core platform technologies (compute, networking, storage, service discovery, security, observability, and CI/CD).
- Demonstrated ability to make high-impact architectural decisions, navigate complex trade-offs, and guide multiple teams toward coherent, long-term technical direction.
- Familiarity with AI-driven systems, tools, or workflows and applying AI/ML concepts to real world products within cloud or platform environments.
- Deep knowledge of observability patterns (Open Telemetry, Prometheus, distributed tracing).
- Expert-level understanding of Infrastructure as Code (Terraform, Pulumi) and CI/CD at scale.
- Proficiency in Go, Rust, or similar languages used in modern platform engineering.
- Track record of defining and evolving multi-year technical strategies for cloud and developer platform ecosystems, and successfully driving adoption of shared platforms across many teams.
- Experience designing and operating highly available, globally distributed systems at internet scale, including capacity planning, performance optimization, and robust failure handling.
- Safely integrates and operates AI/ML-enabled solutions that improve outcomes, such as intelligent routing, predictive scaling, or automated remediation embedded in platform services, with appropriate safeguards.
- Advanced experience applying AI/ML techniques to cloud and platform problems (for example, cost optimization, anomaly detection, or performance tuning) and partnering with data/ML teams to…
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).