IOC Systems Specialist Job Fort Worth area,Texas USA,IT/Tech

Onsite | M-F 8 hr shifts (rotating on call)

Optomi, in partnership with a leading AI cloud infrastructure organization, is seeking an IOC Systems Specialist to join their growing operations team in Fort Worth, TX. This role will provide Tier 2 operational support for high-performance computing (HPC) cloud environments focused on large-scale AI training and inference workloads. The ideal candidate will have hands‑on experience supporting HPC infrastructure, Kubernetes environments, Slurm workload management, and enterprise storage platforms such as WEKA and VAST.

This individual will play a key role in maintaining system stability, troubleshooting complex incidents, and supporting mission‑critical infrastructure within a 24x7 IOC/NOC environment.

What the Right Candidate Will Enjoy:

Working with cutting-edge AI and HPC infrastructure technologies!
Exposure to advanced Kubernetes, cloud, and storage technologies!
Opportunities to contribute to operational improvements and automation initiatives!
Joining a fast‑growing organization focused on sustainable, renewable‑powered AI infrastructure!
Collaborative environment with strong technical leadership and growth opportunities!

What Type of Experience the Right Candidate Has:

2–5 years of experience supporting or operating HPC clusters in production environments
Strong operational experience with WEKA and VAST storage platforms
Hands‑on experience with Kubernetes administration and troubleshooting
Experience supporting Slurm workload manager environments
Familiarity with HPC monitoring, observability, and alerting platforms
Experience performing incident response and root cause analysis in complex systems
Understanding of cloud platforms such as AWS, Azure, or GCP
Knowledge of HPC networking and storage technologies, including Infini Band and high‑throughput interconnects

Responsibilities of the Right Candidate:

Provide Tier 2 operational support for HPC cloud infrastructure environments
Monitor, troubleshoot, and resolve incidents involving Kubernetes, Slurm, storage, networking, and cloud systems
Serve as an escalation point for Tier 1 support teams
Perform root cause analysis and coordinate with engineering teams on permanent resolutions
Execute operational changes, upgrades, patching, and maintenance activities
Maintain and improve operational documentation, runbooks, and knowledge base articles
Support monitoring and observability tooling to proactively identify system issues
Assist with operational readiness and production support for new HPC capabilities
Mentor junior operations staff and support continuous service improvement initiatives
Participate in on‑call rotations and major incident response activities

Job Must Haves:

Must have hands‑on experience with WEKA and VAST storage environments
2–5 years supporting HPC clusters in production or IOC/NOC environments
Working knowledge of Kubernetes
Operational experience with Slurm workload manager
Familiarity with HPC monitoring and observability tooling
Experience with incident response and root cause analysis
Understanding of AWS, Azure, or GCP cloud platforms
Knowledge of HPC networking and storage infrastructure
Ability to work onsite in Fort Worth on a rotating 12‑hour shift schedule

Nice to Have

Skills:

Relevant certifications such as CKA/CKAD, RHCSA, Linux+, ITIL, or Server+
Experience with GPU or HPC vendor technologies
Experience supporting AI or large‑scale compute environments

#J-18808-Ljbffr