×
Register Here to Apply for Jobs or Post Jobs. X

HPC Observability Engineer

Job in Hialeah, Miami-Dade County, Florida, 33002, USA
Listing for: EIT Professionals Corp
Seasonal/Temporary position
Listed on 2026-06-04
Job specializations:
  • IT/Tech
    Systems Engineer, Data Engineer
Salary/Wage Range or Industry Benchmark: 80000 - 100000 USD Yearly USD 80000.00 100000.00 YEAR
Job Description & How to Apply Below

22 hours ago Be among the first 25 applicants

Get AI-powered advice on this job and more exclusive features.

Direct message the job poster from EIT Professionals Corp

Role: HPC Observability Engineer (Python, HPC)

Location:

Remote

Contract

Description:

The client has Grafana and Influx

DB services running on K8S in-house on-premises. Telegraf is used to ingest data from a GPU HPC cluster into Influx

DB. This engineer will help collect and visualize data for the “Terra” platform. The HPC Observability Engineer should have experience in:

  • Setting up and maintaining Grafana dashboards for HPC environments
  • Creating drill-down dashboards for servers, including metrics like memory, network, and CPU utilization
  • Exploring and utilizing out-of-the-box metrics from InfluxDB
  • Writing Python scripts for data ingestion into Influx

    DB with examples
  • Developing a proof of concept with a simple Python script to monitor load
  • Ingesting Infiniband packet data
  • Monitoring LSF jobs in various states
  • Visualizing server-specific and cluster-wide metrics in Grafana
  • Optional:
    Integrating third-party plugins like DDN’s Lustre, Mellanox fabric, etc.
Qualifications and Skills:
  • B.Tech, MS, or PhD in Computer Science or related field
  • 5-8 years of experience with Grafana, Influx

    DB, and Telegraf
  • Experience in Python and Bash scripting is a plus
  • Knowledge of Docker and Google Cloud Platform is advantageous
  • HPC operations experience is beneficial
  • Strong communication skills and ability to work independently
  • Proficiency in requirements analysis and automated testing
  • Ability to write efficient, secure, and well-documented Python code
  • Experience with Git and pipeline development
  • Awareness of modern security and development practices
Responsibilities:
  • Develop and leverage Grafana dashboards and Telegraf configurations
  • Create dashboards for server and cluster metrics
  • Develop Python scripts for data ingestion and documentation
  • Visualize non-native resources in Grafana
  • Optional:
    Integrate third-party plugins
  • Maintain high-quality code and documentation
  • Collaborate with teams to troubleshoot and optimize pipelines
Desired

Skills:
  • Python (good to have)
  • Bash scripting (good to have)
  • Docker (must)
  • HPC operations and LSF (good to have)
  • Experience with DDN Lustre, Mellanox fabric (good to have)
  • Google Cloud Platform (good to have)
  • Knowledge of Git (must)
Seniority level:
  • Mid-Senior level
Employment type:
  • Contract
Job function:
  • Engineering and Information Technology
Industries:
  • IT Services and IT Consulting

This job is active and accepting applications.

#J-18808-Ljbffr
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
 
 
 
Search for further Jobs Here:
(Try combinations for better Results! Or enter less keywords for broader Results)
Location
Increase/decrease your Search Radius (miles)
0
200
Filters
Education Level
Experience Level (years)
Posted in last:
Salary