×
Register Here to Apply for Jobs or Post Jobs. X

Principal, Software Engineer - Cloud Storage

Job in Sunnyvale, Santa Clara County, California, 94087, USA
Listing for: jobr.pro
Full Time position
Listed on 2026-01-08
Job specializations:
  • IT/Tech
    Systems Engineer, Data Engineer
Salary/Wage Range or Industry Benchmark: 125000 - 150000 USD Yearly USD 125000.00 150000.00 YEAR
Job Description & How to Apply Below

Position Summary

We are seeking a highly skilled Principal Engineer (Ceph/Scale-Out Storage) with 10+ years of deep technical experience in distributed storage systems. This role focuses on hands‑on architecture, operations, performance tuning, and troubleshooting of multi‑petabyte scale storage clusters in mission‑critical environments. The ideal candidate will have strong expertise across Linux, networking, storage internals, and distributed systems, and the ability to diagnose complex issues spanning hardware, kernel, and storage layers.

What

You'll Do

Our Private Cloud Storage Engineering team builds and operates some of the largest‑scale Ceph storage clusters in the industry, supporting mission‑critical applications across Walmart’s global ecosystem. With hundreds of PB of data under management across multiple production clusters, we provide the backbone of reliable, secure, and high‑performance storage for business operations, customer platforms, and innovation workloads. We embrace a culture of deep technical expertise, hands‑on problem solving, and continuous learning, while driving adoption of automation, observability, and next‑generation storage technologies.

Key Responsibilities
  • Scale‑Out Distributed Storage Architecture
    • Extensive experience designing, architecting, and managing scale‑out distributed storage systems in large production environments.
    • Expertise in system performance tuning, data durability optimization (replication/erasure coding), and lifecycle management for petabyte‑scale data.
    • Drive evaluation, selection, and deployment of best‑of‑breed software‑defined storage solutions to meet demanding SLAs for latency, throughput, and availability.
  • Ceph Storage Architecture & Operations
    • Architect, deploy, and manage large‑scale Ceph clusters across multiple production sites.
    • Ensure storage availability, data durability, and cluster resiliency through advanced CRUSH map configurations, erasure coding, and replication strategies.
    • Define upgrade strategy, node rebalancing, and hardware refreshes with minimal downtime.
    • Own end‑to‑end lifecycle management of storage clusters, including OS/kernel tuning, firmware upgrades, and hardware integration.
  • Performance, Debugging & Troubleshooting
    • Identify, diagnose, and resolve performance bottlenecks across Ceph/Scale‑Out storage, Linux kernel, networking, and hardware layers.
    • Utilize tools such as perf, blktrace, iostat, tcpdump, bpftrace, atop for advanced debugging.
    • Perform deep analysis of OSD, MON, MDS, RGW performance and optimize cluster parameters.
    • Debug network congestion, packet loss, latency, and RDMA/Ethernet issues impacting storage.
    • Drive root cause analysis for critical production issues and provide long‑term remediation.
Automation & Observability
  • Build and standardize automation for cluster deployment, expansion, and monitoring using Ansible, Terraform, and custom Python/Shell scripts.
  • Develop observability views for real‑time monitoring of IOPS, throughput, latency, and cluster health.
  • Automate alerting, log analysis, and anomaly detection for proactive incident response.
Scalability & Innovation
  • Design storage solutions to scale to hundreds of nodes and multiple petabytes while ensuring high availability and fault tolerance.
  • Collaborate with compute and networking teams to integrate storage clusters with Kubernetes, Open Stack, and VM workloads.
  • Research and implement new features such as CephFS, RGW S3/Swift gateways, Bluestore optimizations, and Rocks

    DB tuning.
  • Evaluate next‑gen hardware (NVMe SSDs, RDMA NICs, high‑density HDDs) and their impact on storage performance.
  • Benchmark and compare next‑gen server SKUs to select the most appropriate storage hardware.
Security & Compliance
  • Implement encryption (at‑rest and in‑transit), access controls, and audit mechanisms for secure data management.
  • Ensure compliance with enterprise and regulatory standards (e.g., PCI‑DSS, SOC, HIPAA).
Collaboration & Mentorship
  • Act as technical SME for storage within the organization, mentoring junior engineers.
  • Collaborate with cross‑functional teams (Compute, Networking, Cloud, Security) to ensure seamless infrastructure integration.
  • Partner with hardware and software…
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
 
 
 
Search for further Jobs Here:
(Try combinations for better Results! Or enter less keywords for broader Results)
Location
Increase/decrease your Search Radius (miles)

Job Posting Language
Employment Category
Education (minimum level)
Filters
Education Level
Experience Level (years)
Posted in last:
Salary