System Performance Engineer AI and HPC. Menlo Park Move Collective Job Menlo Park area,California USA,IT/Tech

Position: System Performance Engineer for AI and HPC. Job in Menlo Park Move Collective Jobs

Summary

Meta is at the forefront of building cutting‑edge AI and high‑performance computing infrastructure, enabling groundbreaking AI research and innovative products. We are seeking a passionate System Performance Engineer for AI and HPC to join our Network Infrastructure Engineering team. In this key role, you will enhance performance characterization, identify bottlenecks, and optimize large‑scale AI training and inference clusters. Work alongside experts in network fabric design and distributed computing to ensure our HPC systems achieve peak performance for advanced model development.

Responsibilities

Profile and benchmark AI training and inference workloads across expansive HPC clusters to detect network, compute, and memory bottlenecks.

Develop and maintain frameworks and dashboards for performance analysis, tracking key metrics such as GPU utilization, network bandwidth, latency, and collective communication effectiveness.

Investigate and resolve performance regressions in distributed AI training settings, focusing on RDMA fabrics, collective communication libraries, and job scheduling.

Collaborate with network infrastructure, hardware, and AI research teams to establish performance requirements and validate new HPC cluster configurations.

Design and conduct capacity and scalability experiments to guide network topology decisions for our AI supercomputing infrastructure.

Build automation tools to continuously monitor HPC system health, detect anomalies, and minimize response time during performance incidents.

Set service level objectives for AI cluster network performance and align cross‑functional teams on reliability and efficiency targets.

Lead technical design reviews for changes in network and system architecture impacting AI workload performance, clearly communicating trade‑offs to various stakeholders.

Mentor fellow engineers on performance methodologies, debugging techniques, and best practices in instrumentation.

Utilize AI‑assisted workflows to speed up root cause analysis, automate performance reporting, and broaden coverage across the HPC stack.

Minimum Qualifications

Experience in profiling and optimizing distributed AI or HPC workloads, with knowledge of GPU interconnects, RDMA networking, and frameworks like NCCL or MPI.

Skilled in debugging complex, non‑reproducible performance issues across multi‑layer systems, including network fabric, OS, and application layers.

Experience designing and implementing performance monitoring systems, with a focus on instrumentation, telemetry pipelines, and alerting for large‑scale infrastructure.

Proven ability to lead cross‑functional technical projects from requirements gathering to production deployment, communicating findings and performance trade‑offs effectively.

6+ years of experience in system performance engineering, network infrastructure engineering, or a related domain within large‑scale distributed computing or HPC environments.

Preferred Qualifications

Experience developing systems software in languages such as C++.

Familiarity with machine learning frameworks like PyTorch and Tensor Flow.

Understanding of RDMA congestion control mechanisms on IB and RoCE Networks.

Keen awareness of the latest advancements in artificial intelligence technologies.

Thorough understanding of AI training workloads and their network demands.

Demonstrated success in utilizing AI tools to optimize workflows, driving measurable impact on efficiency and quality.

Experience implementing responsible and ethical AI practices, including risk assessments and bias mitigation strategies.

Continuous personal development in AI skills, such as context engineering and agent orchestration, while staying updated with emerging technologies.

Compensation

$154,000/year to $217,000/year plus bonus, equity, and benefits.

Industry

Internet

Equal Opportunity

Meta is proud to be an Equal Employment Opportunity and Affirmative Action employer. We do not discriminate based on race, religion, color, national origin, sex (including pregnancy and related medical conditions), sexual orientation, gender identity, age, status as a protected veteran, or status as an individual with a disability, among other legally protected characteristics. We also consider qualified applicants with criminal histories, in line with applicable laws.

Meta participates in the E-Verify program where required. Additionally, we are committed to providing reasonable accommodations for candidates with disabilities in our recruiting process. If you require assistance or accommodations due to a disability, please contact us.

#J-18808-Ljbffr