Platform Engineer Job Bangalore area,Bengaluru Karnataka India,IT/Tech

Location: Bengaluru

We are Cirrus Labs . Our vision is to become the world's most sought-after niche digital transformation company that helps customers realize value through innovation. Our mission is to co-create success with our customers, partners and community. Our goal is to enable employees to dream, grow and make things happen. We are committed to excellence. We are a dependable partner organization that delivers on commitments.

We strive to maintain integrity with our employees and customers. Every action we take is driven by value. The core of who we are is through our well-knit teams and employees. You are the core of a values driven organization.

You have an entrepreneurial spirit. You enjoy working as a part of well-knit teams. You value the team over the individual. You welcome diversity at work and within the greater community. You aren't afraid to take risks. You appreciate a growth path with your leadership team that journeys how you can grow inside and outside of the organization. You thrive upon continuing education programs that your company sponsors to strengthen your skills and for you to become a thought leader ahead of the industry curve.

You are excited about creating change because your skills can help the greater good of every customer, industry and community. We are hiring a talented Senior AI /HPC Platform Engineer(Fabric Specialist) to join our team. If you're excited to be part of a winning team, Cirrus Labs ( http://(Use the "Apply for this Job" box below). ) is a great place to grow your career.

Experience - 5+ years
Location - Bengaluru/Hyderabad
Shift Time- 2 to 11 PM ISTOverview
We are seeking a hands-on AI/HPC Network Engineer to architect, build, and scale our next-generation AI Factory. In this role, you will own the critical "nervous system " of our AI platform: the high-performance network fabric.
This is a rare opportunity to work at the absolute bleeding edge of AI hardware. We are one of the first adopters deploying NVIDIA's GB300 architecture ing away from traditional Infini Band, our environment utilizes a cutting-edge, all-Ethernet architecture powered by NVIDIA Spectrum-X to deliver lossless, low-latency connectivity for massive GPU and CPU clusters.
You will serve as the subject matter expert (SME) for fabric architecture, deep-dive troubleshooting, and performance tuning, ensuring our researchers and data scientists have a highly available, redundant foundation for model training and inference.
This role is not pure "Cloud " Dev Ops:
While we use cloud-native principles, this is a bare-metal infrastructure role. If your experience is limited to clicking buttons in AWS/Azure consoles without understanding physical topology, cabling, or switch OS internals, this is not a match.

Key Responsibilities
High-Performance Fabric Architecture
Architect and deploy NVIDIA Spectrum-X Ethernet fabrics for massive GPU clusters, designing non-blocking Leaf-Spine topologies tailored for AI Backend (East-West) traffic.
Implement and tune robust Layer 3 underlay routing (BGP) to support high-performance RoCEv2 (RDMA over Converged Ethernet) traffic.
Design and configure EVPN/VXLAN overlays to provide workload isolation, multi-tenancy, and seamless integration with Kubernetes CNIs (Calico, Multus, SR-IOV).
Fine-tune "lossless " Ethernet behavior, including Priority Flow Control (PFC), ECN (Explicit Congestion Notification), and buffer/queue management to eliminate tail latency and microbursts during collective operations (All Reduce/All Gather).
Hardware & Physical Layer Engineering Lead the integration of NVIDIA Connect

X-8 Super

NICs and Blue Field-3 DPUs, optimizing firmware settings and offload capabilities for maximum throughput.
Manage the physical connectivity lifecycle by validating and troubleshooting transceiver configurations, DAC/Client cabling, and optical budgets to ensure physical layer errors do not degrade training performance.
Maintain a deep understanding of the boundary between the Scale-Out network (Ethernet) and the Scale-Up network (NVIDIA NVLink network fabric).
Troubleshoot performance bottlenecks where network latency impacts UVM (Unified Virtual Memory) consistency or GPU-to-GPU memory…


Increase/decrease your Search Radius (miles)



Job Posting Language