Principal Software Development Engineer - Cloud Platform Job San Jose area,California USA,Software Development

Principal Software Development Engineer

Our Technology Team partners with teams across Expedia Group to create innovative products, services, and tools to deliver high-quality experiences for travelers, partners, and our employees. A singular technology platform powered by cloud and data provides secure, differentiated, and personalized experiences that drive loyalty and traveler satisfaction.

We are looking for a Principal Engineer to serve as the technical architect for our Cloud Platform organization which sits within our Technology division. As a Principal Engineer reporting to the VP of Cloud Platform, you will be the primary architect of our technical future. The Cloud Platform organization provides the secure, scalable cloud infrastructure, runtime platforms, and developer experience tooling that enable teams across Expedia Group to build, deploy, and operate high-quality, resilient software quickly and safely.

In

this role, you will:

Lead Architectural Evolution:
You’ll own the move toward a Cell-Based Architecture. We need to move away from fragile, monolithic clusters and toward isolated, predictable failure domains that allow us to scale horizontally with confidence.
Modernize Kubernetes & Infrastructure:
You’ll define our K8s strategy, focusing on multi-cluster management, service mesh, and automated scaling. You need to ensure our "Golden Path" makes it easy for engineers to do the right thing by default.
Hardened Reliability & Observability:
You will set the standards for SRE across the org. This means moving beyond basic dashboards to causal observability, automated incident response, and rigorous SLO/SLI management. You’ll help us engineer out the root causes of systemic instability.
Optimize Cloud Economics:
You’ll lead our Fin Ops technical strategy. You need to build the tooling and visibility that allows us to understand cost-per-service and ensures our infrastructure spend is directly tied to business value.
Support the Developer Workflow:
While we are embracing AI tools, your job is to build the underlying "agent-friendly" infrastructure. This includes standardized Dev Containers and ephemeral environments that allow for fast, isolated iteration without clobbering shared state.

Minimum Qualifications:

Extensive professional software development experience designing, building, and operating large-scale, cloud-native distributed systems and platform services on Kubernetes.
Proven ownership of critical services or multi-service platforms, including responsibility for system design (LLD), API design, data modeling, deployment, and ongoing operational health.
Deep expertise with at least one major public cloud provider and core platform technologies (compute, networking, storage, service discovery, security, observability, and CI/CD).
Demonstrated ability to make high-impact architectural decisions, navigate complex trade-offs, and guide multiple teams toward coherent, long-term technical direction.
Familiarity with AI-driven systems, tools, or workflows and applying AI/ML concepts to real world products within cloud or platform environments.
Deep knowledge of observability patterns (Open Telemetry, Prometheus, distributed tracing).
Expert-level understanding of Infrastructure as Code (Terraform, Pulumi) and CI/CD at scale.
Proficiency in Go, Rust, or similar languages used in modern platform engineering.

Preferred Qualifications:

Track record of defining and evolving multi-year technical strategies for cloud and developer platform ecosystems, and successfully driving adoption of shared platforms across many teams.
Experience designing and operating highly available, globally distributed systems at internet scale, including capacity planning, performance optimization, and robust failure handling.
Safely integrates and operates AI/ML-enabled solutions that improve outcomes, such as intelligent routing, predictive scaling, or automated remediation embedded in platform services, with appropriate safeguards.
Advanced experience applying AI/ML techniques to cloud and platform problems (for example, cost optimization, anomaly detection, or performance tuning) and partnering with data/ML teams to…