×
Register Here to Apply for Jobs or Post Jobs. X

Senior Cloud Engineer

Job in Foster City, San Mateo County, California, 94420, USA
Listing for: Diverse Lynx
Full Time position
Listed on 2026-02-13
Job specializations:
  • IT/Tech
    Systems Engineer, Cloud Computing, Data Engineer, Systems Administrator
Salary/Wage Range or Industry Benchmark: 80000 - 100000 USD Yearly USD 80000.00 100000.00 YEAR
Job Description & How to Apply Below

Top Skills Required for this role:

  • HPC – High performance computing
  • AWS cloud services
  • Dev Ops CI/CD
  • Python
Scope of Work: HPC Cluster Deployment:

Automate the deployment process of HPC clusters using CI/CD pipelines by utilizing Git Hub pipeline and AWS Systems Manager.

Implement CI/CD pipelines to manage and deploy updates to the HPC cluster efficiently.

Set up and configure HPC clusters to meet specific requirements and workloads.

Manage and maintain HPC hardware components such as CPUs and GPUs, along with the necessary software.

Conduct regression testing to verify the functionality and performance of non‑GXP HPC clusters.

Workload Scheduler Management:

Install and configure workload managers and schedulers like LSF, SLURM, and PBS Pro.

Manage the addition and removal of compute nodes and adjust the priority of master and slave nodes.

Develop and manage resource policies and rules to optimize cluster performance.

Configure and allocate resources such as CPU and memory, and profile applications for optimal performance.

Address and resolve issues related to schedulers, daemons, and license servers.

Network and High‑Performance Connectivity Management:

Install and configure HPC interconnect networks.

Design and configure the network topology for HPC clusters.

Ensure the maintenance and monitoring of Infini Band connectivity.

Resolve connectivity issues related to Infini Band, RoCE, and Ethernet.

Monitoring and Reports:

Produce daily health check reports for the HPC cluster.

Automate monitoring scripts to streamline the monitoring process.

Conduct periodic reviews of reports and audit trails.

OS Administration and Management:

Install and configure operating systems for HPC clusters.

Address OS‑related issues such as CPU, memory, and SWAP utilization, and perform application file system cleanup.

Ensure application service continuity by performing pre and post checks from both OS and application perspectives during planned and unplanned outages.

Applications and Tools:

Install HPC libraries and tools such as MPI and compilers.

Install and configure HPC applications, both commercial off‑the‑shelf (COTS) and open source, and manage packages using Spack.

Apply patches and upgrades to HPC applications.

Resolve issues related to HPC applications.

HPC Storage Management:

Administer and configure HPC storage systems.

Oversee the administration of HPC file systems.

Monitor and troubleshoot HPC storage systems.

Manage backup and tape library systems.

Key Responsibilities
  • Cluster Management:
    Install, configure, and maintain compute nodes, GPUs (NVIDIA), high‑speed storage (Lustre, GPFS), and interconnects (Infini Band, RoCE).
  • Performance Tuning:
    Optimize scientific applications, kernels, and workflows for maximum throughput, scalability, and minimal queue times.
  • User Support:
    Act as a technical expert for researchers, debugging jobs, resolving complex issues, and providing training on tools and best practices.
  • Software Management:
    Manage workload managers (Slurm, LSF), schedulers, software licensing (FlexLM), OpenPBS, containers (Singularity), and compilers.
  • Infrastructure:
    Administer high‑speed interconnects (Infini Band), storage (Lustre, CEPH), and potentially cloud/hybrid solutions.
  • Implement and manage monitoring (Grafana, Prometheus) and orchestration tools (Slurm, Kubernetes).
  • Automation:
    Develop scripts (Python, Ansible) for provisioning, monitoring, and automating routine tasks.
  • Security & Policy:
    Implement and enforce security policies, manage user access, and oversee lifecycle management.
Essential

Skills & Qualifications
  • Technical Expertise:
    Strong Linux, Python, scripting (Ansible, Terraform), HPC schedulers (Slurm), networking (Infini Band), and GPU computing.
  • Team will have knowledge of Gilead systems and AWS CICD pipelines.
  • HPC Domain Knowledge:
    Experience with parallel file systems, workload management, and performance analysis tools.
  • Problem Solving:
    Excellent analytical and debugging skills for complex distributed systems.
  • Communication:
    Ability to explain complex technical issues to scientists and non‑technical stakeholders.
Experience

Hands‑on experience in data centers, managing large clusters, and supporting diverse scientific/AI workloads.

Diverse Lynx LLC is an Equal Employment Opportunity employer. All qualified applicants will receive due consideration for employment without any discrimination. All applicants will be evaluated solely on the basis of their ability, competence and their proven capability to perform the functions outlined in the corresponding role. We promote and support a diverse workforce across all levels in the company.

#J-18808-Ljbffr
Position Requirements
10+ Years work experience
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
 
 
 
Search for further Jobs Here:
(Try combinations for better Results! Or enter less keywords for broader Results)
Location
Increase/decrease your Search Radius (miles)

Job Posting Language
Employment Category
Education (minimum level)
Filters
Education Level
Experience Level (years)
Posted in last:
Salary