Principal Performance Engineer Job Wayne area,Pennsylvania USA,Manufacturing / Production

Cornelis Networks delivers the world’s highest performance scale-out networking solutions for AI and HPC datacenters. Our differentiated architecture seamlessly integrates hardware,software and system level technologies to maximize the efficiency of GPU,CPUand accelerator-based compute clusters at any scale. Our solutions drive breakthroughs in AI & HPC workloads, empowering our customers to push the boundaries of innovation. Backed by top-tier venture capital and strategic investors, we are committed to innovation,performance and scalability – solving the world’s most demanding computational challenges with our next-generation networking solutions.

We are a fast-growing, forward-thinking team of architects, engineers, and business professionals with a proven track recordof building successful products and companies. As a global organization, our team spans multiple U.S. states and six countries, and we continue to expand with exceptional talent in onsite, hybrid, and fully remote roles.

We’re seeking a Principal Performance Engineer to drive end-to-end performance for next-generation networking silicon and systems (adapters, switches, software) . You will help set the performance strategy , lead investigations across layers (switch/silicon ? drivers ? AI/HPC workloads) , and enable large-scale customer deployments across multiple verticals (cloud, autonomous, aerospace/defense, manufacturing, life sciences, climate). You’ll partner directly with architecture, firmware, software, and lighthouse customers to raise the performance ceiling.

This is a high-impact, highly visible individual-contributor role with technical leadership scoping (mentoring, cross-functional influence).

Key Responsibilities:

Own pre- and post-launch performance : plan, execute, and sustain performance validation, debugging, and optimization for adapters, switches, and fabric software—first in lab, then at scale in production.
Lead performance for post-silicon bring-up validation of networking ASICs and end-products (adapters, switches, etc.); driving optimization and characterization against networking metrics and application performance.
Deliver white-glove customer support at scale : reproduce field issues, co-debug in shared/onsite labs, land mitigations and durable fixes, and publish per-customer tuning guides; opportunity to grow into customer performance support lead while remaining an IC.
Pathfind and optimize forward-looking workloads : drive research and enablement for AI inference (QPS, P99/P99.9, cost/throughput), distributed AI training (NCCL/RCCL collectives), and traditional HPC (manufacturing, life sciences, climate).
Multi-fabric research & enablement : evaluate and tune Cornelis/Omni-Path, Ethernet/RoCEv2, and Infini Band across topologies (Clos/fat-tree/dragonfly), routing (ECMP/adaptive), and congestion control (credit, PFC/ECN/DCQCN)
Design credible experiments : synthesize representative traffic, replay workload traces, and run on-cluster A/B tests with statistically sound comparisons (P50/P90/P99).

Required Qualifications:

10+ years in performance engineering, post-silicon/perf validation, or systems performance for high-speed networking or HPC/AI products.
Post-silicon expertise : hands‑on bring‑up and performance validation of networking ASICs/systems (adapters, switches), including crafting validation plans, establishing pass/fail, correlating pre‑silicon models to silicon, and driving fixes from first silicon through production.
Demonstrated depth in networking hardware (switch/silicon) and software debug for performance tuning and issue resolution across production‑scale deployments.
Hands‑on multi‑fabric experience:
Cornelis/Omni‑Path, Ethernet/RoCEv2, and/or Infini Band; strong grasp of PCIe/GPU‑Direct, queueing/QoS, and congestion control (credit, PFC, ECN, DCQCN).
AI/HPC workload fluency: NCCL/RCCL collectives, UCX/ libfabric /MPI; ability to optimize end‑to‑end training and inference (throughput, QPS, tail latency, efficiency) on real clusters.
Experimentation & analysis: workload modeling, on‑cluster A/B tests, tail‑latency analysis (P50/P90/P99); ability to separate congestion from compute/IO…


Increase/decrease your Search Radius (miles)



Job Posting Language