Principal, Software Engineer - Cloud Storage
Listed on 2025-12-08
-
IT/Tech
Systems Engineer, Data Engineer
Position Summary. We are seeking a highly skilled Principal Engineer (Ceph/Scale-Out Storage) with 10years+ of deep technical experience in distributed storage systems. This role is focused on hands‑on architecture, operations, performance tuning, and troubleshooting of multi‑petabyte scale storage clusters in mission‑critical environments. The ideal candidate will have strong expertise across Linux, networking, storage internals, and distributed systems, with the ability to diagnose complex issues spanning hardware, kernel, and storage layers.
This role requires a technical leader and subject matter expert (SME) who can architect resilient storage platforms, resolve production incidents under pressure, and drive innovation in private cloud storage at scale.
THIS ROLE DOES NOT PROVIDE SPONSORSHIP
Our Private Cloud Storage Engineering team is responsible for building and operating some of the largest‑scale Ceph storage clusters in the industry, supporting mission‑critical applications across Walmart’s global ecosystem. With hundreds of PB of data under management across multiple production clusters, we provide the backbone of reliable, secure, and high‑performance storage for business operations, customer platforms, and innovation workloads.
The team works at the intersection of distributed storage systems, Linux internals, networking, and cloud infrastructure, solving some of the toughest technical challenges in scalability, performance, and resilience. We embrace a culture of deep technical expertise, hands‑on problem solving, and continuous learning, while driving adoption of automation, observability, and next‑generation storage technologies.
As part of this team, you will collaborate with world‑class engineers across compute, networking, security, and cloud to design end‑to‑end solutions, shape the future of enterprise storage platforms, and contribute to the broader open‑source storage community.
Key Responsibilities Scale‑Out Distributed Storage Architecture- Extensive experience in the design, architecture, and management of scale‑out distributed storage systems in large production environments.
- Demonstrated expertise in system performance tuning, data durability optimization (replication and/or erasure coding), and lifecycle management for petabyte‑scale data deployments.
- Proven ability to drive the evaluation, selection, and deployment of best‑of‑breed software‑defined storage (SDS) solutions that meet demanding SLAs for latency, throughput, and availability.
- Architect, deploy, and manage large‑scale clusters across multiple production sites.
- Ensure storage availability, data durability, and cluster resiliency through advanced CRUSH map configurations, erasure coding, and replication strategies.
- Define upgrade strategy, cluster augmentation, node rebalancing, and hardware refreshes with minimal downtime.
- Own end‑to‑end lifecycle management of storage clusters, including OS/Kernel tuning, firmware upgrades, and hardware integration.
- Deep (hands‑on) architectural experience with the design, deployment, and management of large‑scale Open Stack platforms in production environments.
- Expert‑level knowledge of core Open Stack storage services, specifically Cinder (Block Storage), Swift (Object Storage), and/or the integration of Ceph or similar distributed storage solutions.
- Experience must include data center networking design, high‑availability design and multi‑region/multi‑site Open Stack deployments.
- Identify, diagnose, and resolve performance bottlenecks across Ceph/Scale‑Out storage solution, Linux kernel, networking, and hardware layers.
- Utilize tools such as perf, blktrace, iostat, tcpdump, bpftrace, atop for advanced debugging.
- Perform deep analysis of OSD, MON, MDS, RGW performance and optimize cluster parameters.
- Debug network congestion, packet loss, latency, and RDMA/Ethernet issues impacting storage.
- Drive root cause analysis (RCA) for critical production issues and provide long‑term remediation.
- Build and standardize automation for cluster deployment,…
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).