Principal, Software Engineer - Cloud Storage
Listed on 2026-01-08
-
IT/Tech
Systems Engineer, Data Engineer
Position Summary
We are seeking a highly skilled Principal Engineer (Ceph/Scale-Out Storage) with 10+ years of deep technical experience in distributed storage systems. This role focuses on hands‑on architecture, operations, performance tuning, and troubleshooting of multi‑petabyte scale storage clusters in mission‑critical environments. The ideal candidate will have strong expertise across Linux, networking, storage internals, and distributed systems, and the ability to diagnose complex issues spanning hardware, kernel, and storage layers.
WhatYou'll Do
Our Private Cloud Storage Engineering team builds and operates some of the largest‑scale Ceph storage clusters in the industry, supporting mission‑critical applications across Walmart’s global ecosystem. With hundreds of PB of data under management across multiple production clusters, we provide the backbone of reliable, secure, and high‑performance storage for business operations, customer platforms, and innovation workloads. We embrace a culture of deep technical expertise, hands‑on problem solving, and continuous learning, while driving adoption of automation, observability, and next‑generation storage technologies.
Key Responsibilities- Scale‑Out Distributed Storage Architecture
- Extensive experience designing, architecting, and managing scale‑out distributed storage systems in large production environments.
- Expertise in system performance tuning, data durability optimization (replication/erasure coding), and lifecycle management for petabyte‑scale data.
- Drive evaluation, selection, and deployment of best‑of‑breed software‑defined storage solutions to meet demanding SLAs for latency, throughput, and availability.
- Ceph Storage Architecture & Operations
- Architect, deploy, and manage large‑scale Ceph clusters across multiple production sites.
- Ensure storage availability, data durability, and cluster resiliency through advanced CRUSH map configurations, erasure coding, and replication strategies.
- Define upgrade strategy, node rebalancing, and hardware refreshes with minimal downtime.
- Own end‑to‑end lifecycle management of storage clusters, including OS/kernel tuning, firmware upgrades, and hardware integration.
- Performance, Debugging & Troubleshooting
- Identify, diagnose, and resolve performance bottlenecks across Ceph/Scale‑Out storage, Linux kernel, networking, and hardware layers.
- Utilize tools such as perf, blktrace, iostat, tcpdump, bpftrace, atop for advanced debugging.
- Perform deep analysis of OSD, MON, MDS, RGW performance and optimize cluster parameters.
- Debug network congestion, packet loss, latency, and RDMA/Ethernet issues impacting storage.
- Drive root cause analysis for critical production issues and provide long‑term remediation.
- Build and standardize automation for cluster deployment, expansion, and monitoring using Ansible, Terraform, and custom Python/Shell scripts.
- Develop observability views for real‑time monitoring of IOPS, throughput, latency, and cluster health.
- Automate alerting, log analysis, and anomaly detection for proactive incident response.
- Design storage solutions to scale to hundreds of nodes and multiple petabytes while ensuring high availability and fault tolerance.
- Collaborate with compute and networking teams to integrate storage clusters with Kubernetes, Open Stack, and VM workloads.
- Research and implement new features such as CephFS, RGW S3/Swift gateways, Bluestore optimizations, and Rocks
DB tuning. - Evaluate next‑gen hardware (NVMe SSDs, RDMA NICs, high‑density HDDs) and their impact on storage performance.
- Benchmark and compare next‑gen server SKUs to select the most appropriate storage hardware.
- Implement encryption (at‑rest and in‑transit), access controls, and audit mechanisms for secure data management.
- Ensure compliance with enterprise and regulatory standards (e.g., PCI‑DSS, SOC, HIPAA).
- Act as technical SME for storage within the organization, mentoring junior engineers.
- Collaborate with cross‑functional teams (Compute, Networking, Cloud, Security) to ensure seamless infrastructure integration.
- Partner with hardware and software…
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).