HPC consultant Job Fremont area,California USA,IT/Tech

Role- HPC consultant
Location
- Fremont, CA/ Tualatin, OR
Contract

HPC Cluster & Scheduler Management

• Design, configure, tune, and optimize SLURM partitions, queues, QoS, and scheduling policies to maximize cluster utilization and workload efficiency.

• Perform in-depth analysis of job scheduling behavior, bottlenecks, and resource contention.

• Troubleshoot job failures, performance degradation, and scheduler-related issues in production HPC environments.

• Implement fair-share, backfill, reservations, and policy-driven scheduling as required.
Storage Benchmarking & Procurement Support

• Lead HPC storage performance benchmarking using industry-standard tools (e.g., IOR, FIO, MDTest, IOzone).

• Analyze I/O patterns of HPC workloads and map them to appropriate storage architectures (parallel file systems, NVMe, Lustre, Spectrum Scale, etc.).

• Provide technical input for storage selection and procurement, including performance expectations, sizing, and cost-performance tradeoffs.

• Collaborate with vendors and internal teams during POCs and performance validation exercises.
HPC Application Build & Optimization

• Build, install, configure, and maintain HPC applications, compilers, libraries, and scientific software stacks.

• Optimize application performance using MPI, OpenMP, GPU acceleration (where applicable), and tuned math libraries.

• Support multiple compiler tool chains (GCC, Intel, LLVM, NVIDIA HPC SDK, etc.).

• Implement and manage environment modules (Lmod) or similar software management frameworks.
System Performance & Operations

• Conduct system-level performance tuning across compute, memory, network, and storage layers.

• Diagnose node-level issues involving CPU, GPU, interconnects (Infini Band/Ethernet), and OS configurations.

• Create operational runbooks, performance baselines, and troubleshooting documentation.

• Support cluster upgrades, expansions, and hardware refresh activities.
Collaboration & Delivery

• Work closely with application owners, researchers, and infrastructure teams to meet aggressive delivery timelines.

• Translate workload requirements into practical HPC configurations and optimizations.

• Provide clear technical guidance and recommendations to leadership and stakeholders.
Required Skills & Experience
Core HPC Skills

• 8-12+ years of hands-on HPC engineering experience in production environments.

• Strong expertise with SLURM (configuration, tuning, troubleshooting).

• Solid understanding of Linux systems (RHEL/CentOS/Rocky/Alma preferred).

• Deep knowledge of HPC storage systems and I/O performance analysis.

• Proven experience building and optimizing HPC applications and libraries.
Technical Proficiency

• MPI implementations (Open MPI, MPICH), OpenMP

• Compilers and tool chains (GCC, Intel, NVIDIA HPC SDK)

• Performance tools (perf, vtune, nvprof/nsys, IB diagnostics)

• Environment modules (Lmod), package managers (Spack preferred)

• Bash/Python scripting for automation and diagnostics

Nice to Have

• Experience with GPU-based HPC workloads (NVIDIA CUDA, ROCm).

• Exposure to cloud-based HPC (Azure, AWS, GCP).

• Familiarity with parallel file systems such as Lustre or IBM Spectrum Scale.

• Vendor engagement experience for HPC hardware/storage evaluations.