Senior Cloud Engineer Job Foster City area,California USA,IT/Tech

Top Skills Required for this role:

HPC – High performance computing
AWS cloud services
Dev Ops CI/CD
Python

Scope of Work: HPC Cluster Deployment:

Automate the deployment process of HPC clusters using CI/CD pipelines by utilizing Git Hub pipeline and AWS Systems Manager.

Implement CI/CD pipelines to manage and deploy updates to the HPC cluster efficiently.

Set up and configure HPC clusters to meet specific requirements and workloads.

Manage and maintain HPC hardware components such as CPUs and GPUs, along with the necessary software.

Conduct regression testing to verify the functionality and performance of non‑GXP HPC clusters.

Workload Scheduler Management:

Install and configure workload managers and schedulers like LSF, SLURM, and PBS Pro.

Manage the addition and removal of compute nodes and adjust the priority of master and slave nodes.

Develop and manage resource policies and rules to optimize cluster performance.

Configure and allocate resources such as CPU and memory, and profile applications for optimal performance.

Address and resolve issues related to schedulers, daemons, and license servers.

Network and High‑Performance Connectivity Management:

Install and configure HPC interconnect networks.

Design and configure the network topology for HPC clusters.

Ensure the maintenance and monitoring of Infini Band connectivity.

Resolve connectivity issues related to Infini Band, RoCE, and Ethernet.

Monitoring and Reports:

Produce daily health check reports for the HPC cluster.

Automate monitoring scripts to streamline the monitoring process.

Conduct periodic reviews of reports and audit trails.

OS Administration and Management:

Install and configure operating systems for HPC clusters.

Address OS‑related issues such as CPU, memory, and SWAP utilization, and perform application file system cleanup.

Ensure application service continuity by performing pre and post checks from both OS and application perspectives during planned and unplanned outages.

Applications and Tools:

Install HPC libraries and tools such as MPI and compilers.

Install and configure HPC applications, both commercial off‑the‑shelf (COTS) and open source, and manage packages using Spack.

Apply patches and upgrades to HPC applications.

Resolve issues related to HPC applications.

HPC Storage Management:

Administer and configure HPC storage systems.

Oversee the administration of HPC file systems.

Monitor and troubleshoot HPC storage systems.

Manage backup and tape library systems.

Key Responsibilities

Cluster Management:
Install, configure, and maintain compute nodes, GPUs (NVIDIA), high‑speed storage (Lustre, GPFS), and interconnects (Infini Band, RoCE).
Performance Tuning:
Optimize scientific applications, kernels, and workflows for maximum throughput, scalability, and minimal queue times.
User Support:
Act as a technical expert for researchers, debugging jobs, resolving complex issues, and providing training on tools and best practices.
Software Management:
Manage workload managers (Slurm, LSF), schedulers, software licensing (FlexLM), OpenPBS, containers (Singularity), and compilers.
Infrastructure:
Administer high‑speed interconnects (Infini Band), storage (Lustre, CEPH), and potentially cloud/hybrid solutions.
Implement and manage monitoring (Grafana, Prometheus) and orchestration tools (Slurm, Kubernetes).
Automation:
Develop scripts (Python, Ansible) for provisioning, monitoring, and automating routine tasks.
Security & Policy:
Implement and enforce security policies, manage user access, and oversee lifecycle management.

Essential

Skills & Qualifications

Technical Expertise:
Strong Linux, Python, scripting (Ansible, Terraform), HPC schedulers (Slurm), networking (Infini Band), and GPU computing.
Team will have knowledge of Gilead systems and AWS CICD pipelines.
HPC Domain Knowledge:
Experience with parallel file systems, workload management, and performance analysis tools.
Problem Solving:
Excellent analytical and debugging skills for complex distributed systems.
Communication:
Ability to explain complex technical issues to scientists and non‑technical stakeholders.

Experience

Hands‑on experience in data centers, managing large clusters, and supporting diverse scientific/AI workloads.

Diverse Lynx LLC is an Equal Employment Opportunity employer. All qualified applicants will receive due consideration for employment without any discrimination. All applicants will be evaluated solely on the basis of their ability, competence and their proven capability to perform the functions outlined in the corresponding role. We promote and support a diverse workforce across all levels in the company.

#J-18808-Ljbffr


Increase/decrease your Search Radius (miles)



Job Posting Language