More jobs:
HPC Observability Engineer
Job in
Hialeah, Miami-Dade County, Florida, 33002, USA
Listed on 2026-06-04
Listing for:
EIT Professionals Corp
Seasonal/Temporary
position Listed on 2026-06-04
Job specializations:
-
IT/Tech
Systems Engineer, Data Engineer
Job Description & How to Apply Below
22 hours ago Be among the first 25 applicants
Get AI-powered advice on this job and more exclusive features.
Direct message the job poster from EIT Professionals Corp
Role: HPC Observability Engineer (Python, HPC)Location:
Remote
Contract
Description:The client has Grafana and Influx
DB services running on K8S in-house on-premises. Telegraf is used to ingest data from a GPU HPC cluster into Influx
DB. This engineer will help collect and visualize data for the “Terra” platform. The HPC Observability Engineer should have experience in:
- Setting up and maintaining Grafana dashboards for HPC environments
- Creating drill-down dashboards for servers, including metrics like memory, network, and CPU utilization
- Exploring and utilizing out-of-the-box metrics from InfluxDB
- Writing Python scripts for data ingestion into Influx
DB with examples - Developing a proof of concept with a simple Python script to monitor load
- Ingesting Infiniband packet data
- Monitoring LSF jobs in various states
- Visualizing server-specific and cluster-wide metrics in Grafana
- Optional:
Integrating third-party plugins like DDN’s Lustre, Mellanox fabric, etc.
- B.Tech, MS, or PhD in Computer Science or related field
- 5-8 years of experience with Grafana, Influx
DB, and Telegraf - Experience in Python and Bash scripting is a plus
- Knowledge of Docker and Google Cloud Platform is advantageous
- HPC operations experience is beneficial
- Strong communication skills and ability to work independently
- Proficiency in requirements analysis and automated testing
- Ability to write efficient, secure, and well-documented Python code
- Experience with Git and pipeline development
- Awareness of modern security and development practices
- Develop and leverage Grafana dashboards and Telegraf configurations
- Create dashboards for server and cluster metrics
- Develop Python scripts for data ingestion and documentation
- Visualize non-native resources in Grafana
- Optional:
Integrate third-party plugins - Maintain high-quality code and documentation
- Collaborate with teams to troubleshoot and optimize pipelines
Skills:
- Python (good to have)
- Bash scripting (good to have)
- Docker (must)
- HPC operations and LSF (good to have)
- Experience with DDN Lustre, Mellanox fabric (good to have)
- Google Cloud Platform (good to have)
- Knowledge of Git (must)
- Mid-Senior level
- Contract
- Engineering and Information Technology
- IT Services and IT Consulting
This job is active and accepting applications.
#J-18808-LjbffrTo View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
Search for further Jobs Here:
×