Principal,Software Engineer - Cloud Storage Job Sunnyvale area,California USA,IT/Tech

Position Summary

We are seeking a highly skilled Principal Engineer (Ceph/Scale-Out Storage) with 10+ years of deep technical experience in distributed storage systems. This role focuses on hands‑on architecture, operations, performance tuning, and troubleshooting of multi‑petabyte scale storage clusters in mission‑critical environments. The ideal candidate will have strong expertise across Linux, networking, storage internals, and distributed systems, and the ability to diagnose complex issues spanning hardware, kernel, and storage layers.

What

You'll Do

Our Private Cloud Storage Engineering team builds and operates some of the largest‑scale Ceph storage clusters in the industry, supporting mission‑critical applications across Walmart’s global ecosystem. With hundreds of PB of data under management across multiple production clusters, we provide the backbone of reliable, secure, and high‑performance storage for business operations, customer platforms, and innovation workloads. We embrace a culture of deep technical expertise, hands‑on problem solving, and continuous learning, while driving adoption of automation, observability, and next‑generation storage technologies.

Key Responsibilities

Scale‑Out Distributed Storage Architecture
- Extensive experience designing, architecting, and managing scale‑out distributed storage systems in large production environments.
- Expertise in system performance tuning, data durability optimization (replication/erasure coding), and lifecycle management for petabyte‑scale data.
- Drive evaluation, selection, and deployment of best‑of‑breed software‑defined storage solutions to meet demanding SLAs for latency, throughput, and availability.
Ceph Storage Architecture & Operations
- Architect, deploy, and manage large‑scale Ceph clusters across multiple production sites.
- Ensure storage availability, data durability, and cluster resiliency through advanced CRUSH map configurations, erasure coding, and replication strategies.
- Define upgrade strategy, node rebalancing, and hardware refreshes with minimal downtime.
- Own end‑to‑end lifecycle management of storage clusters, including OS/kernel tuning, firmware upgrades, and hardware integration.
Performance, Debugging & Troubleshooting
- Identify, diagnose, and resolve performance bottlenecks across Ceph/Scale‑Out storage, Linux kernel, networking, and hardware layers.
- Utilize tools such as perf, blktrace, iostat, tcpdump, bpftrace, atop for advanced debugging.
- Perform deep analysis of OSD, MON, MDS, RGW performance and optimize cluster parameters.
- Debug network congestion, packet loss, latency, and RDMA/Ethernet issues impacting storage.
- Drive root cause analysis for critical production issues and provide long‑term remediation.

Automation & Observability

Build and standardize automation for cluster deployment, expansion, and monitoring using Ansible, Terraform, and custom Python/Shell scripts.
Develop observability views for real‑time monitoring of IOPS, throughput, latency, and cluster health.
Automate alerting, log analysis, and anomaly detection for proactive incident response.

Scalability & Innovation

Design storage solutions to scale to hundreds of nodes and multiple petabytes while ensuring high availability and fault tolerance.
Collaborate with compute and networking teams to integrate storage clusters with Kubernetes, Open Stack, and VM workloads.
Research and implement new features such as CephFS, RGW S3/Swift gateways, Bluestore optimizations, and Rocks

DB tuning.
Evaluate next‑gen hardware (NVMe SSDs, RDMA NICs, high‑density HDDs) and their impact on storage performance.
Benchmark and compare next‑gen server SKUs to select the most appropriate storage hardware.

Security & Compliance

Implement encryption (at‑rest and in‑transit), access controls, and audit mechanisms for secure data management.
Ensure compliance with enterprise and regulatory standards (e.g., PCI‑DSS, SOC, HIPAA).

Collaboration & Mentorship

Act as technical SME for storage within the organization, mentoring junior engineers.
Collaborate with cross‑functional teams (Compute, Networking, Cloud, Security) to ensure seamless infrastructure integration.
Partner with hardware and software…


Increase/decrease your Search Radius (miles)



Job Posting Language

Principal, Software Engineer - Cloud Storage