Product Manager Job Bangalore area,Bengaluru Karnataka India,IT/Tech

Location: Bengaluru

Product Manager - AI Data Center Infrastructure/DCM

Job Family Definition:

We are seeking a Product Line Manager (PLM) for AI Data Center Infrastructure to define and deliver next-generation data center networking platforms for large-scale GPU clusters. This role is ideal for a visionary, hands-on leader who understands how AI workloads stress networks at scale and can translate that insight into clear product requirements and roadmaps.

The successful candidate will have deep experience with data center switching platforms, high-performance Ethernet fabrics, and GPU/NIC interconnects across NVIDIA and AMD ecosystems. You will drive the architecture and product strategy for scale-up and scale-out AI fabrics, enabling deterministic performance, ultra-low latency, and operational excellence for hyperscale AI training and inference clusters.

This role requires a self-starter and go-getter who can operate independently while collaborating across engineering, operations, and strategic partners.

What you will do:

AI Data Center & Fabric Architecture

- Define product requirements for AI data center network architectures supporting thousands of GPUs.
- Develop requirements for low-latency Ethernet fabrics using Juniper QFX platforms and Apstra-based automation.
- Enable high-bandwidth GPU and NIC interconnects optimized for large-scale distributed training and inference workloads.

GPU, NIC & Interconnect Strategy

- Lead requirements definition for next-generation GPUs, NICs, and interconnect technologies, staying ahead of industry roadmaps.
- Drive alignment with: NVIDIA:
Connect

X (CX7/CX8), NVLink, NVSwitch, AMD: MI300/MI400 platforms, Pollara NICs, Infinity Fabric
- Ensure interoperability across DAC, AEC, ACC, and optical transceivers between switches and NIC endpoints.
- Define scale-up paths using PCIe, NVLink, NVSwitch, ensuring GPU-to-GPU symmetry, consistency, and bandwidth determinism.

Switching, Routing & Telemetry

- Specify and optimize L2/L3 architectures, including EVPN-VXLAN, Class-E IPv4, and AI-optimized buffer tuning.
- Leverage hardware telemetry, streaming sensors, and analytics for proactive performance assurance.
- Drive automation using Python, Ansible, Apstra, Terraform, and related tools to enforce configuration consistency and compliance.

Performance Optimization & Troubleshooting

- Analyze GPU job performance to identify network hotspots, congestion, packet loss, and microbursts.
- Tune ECN, RDMA/ROCEv2, PFC, and traffic-engineering policies for AI workloads.
- Optimize server-to-switch interactions, including: BIOS and firmware alignment, NIC queue and link-training parameters, Cable selection and management (AEC/ACC/optics)

Cross-Functional & Ecosystem Collaboration

- Partner closely with AI platform teams, GPU system architects, data center operations, and strategic vendors (NVIDIA, AMD, Juniper).
- Lead and participate in root-cause analysis for:
Link flaps and training failures, FEC and PCS errors, Thermal or power-related performance degradation
- Drive lab validation, scale testing, and certification of new optics, NIC firmware, and switch software releases.

What you need to bring:

- 5–10+ years of experience in data center networking, AI infrastructure, or HPC environments.
- Strong hands-on experience with Juniper QFX platforms and JunOS.
- Deep understanding of GPU architectures:
- NVIDIA: H100/H200, GB200/GB300, NVLink/NVSwitch
- AMD: MI300/MI400, Pollara NICs, Infinity Fabric
- Proven expertise in scale-up GPU interconnects and scale-out Ethernet fabrics.
- Strong knowledge of RDMA/ROCEv2, ECN, PFC, and buffer management.
- Familiarity with distributed AI workloads, collective operations (NCCL, RCCL).
- Hands-on troubleshooting experience with high-speed optics, AEC cables, link training, and NIC firmware.
- Proficiency in automation and scripting (Python, Ansible, Bash, Terraform).

Preferred Qualifications

- Certification : JNCIE , CCIE, (NCP-AII), (NCA-AIIO), (NCP-AIO), (NCP-AIN)

- Experience with Apstra or other intent-based networking platforms.
- Knowledge of 1.6T optics, 200G PAM4 Ser Des, and CPO/LPO architectures.
- Experience supporting liquid-cooled GPU clusters and rack-level power/network design.
- Understanding of data center operations, observability, and SLAs for AI training and inference clusters.


Increase/decrease your Search Radius (miles)



Job Posting Language