IOC Systems Specialist
Listed on 2026-06-12
-
IT/Tech
SRE/Site Reliability, Cloud Computing: Infrastructure & Operations, IT Infrastructure, IT Support
Onsite | M-F 8 hr shifts (rotating on call)
Optomi, in partnership with a leading AI cloud infrastructure organization, is seeking an IOC Systems Specialist to join their growing operations team in Fort Worth, TX. This role will provide Tier 2 operational support for high-performance computing (HPC) cloud environments focused on large-scale AI training and inference workloads. The ideal candidate will have hands‑on experience supporting HPC infrastructure, Kubernetes environments, Slurm workload management, and enterprise storage platforms such as WEKA and VAST.
This individual will play a key role in maintaining system stability, troubleshooting complex incidents, and supporting mission‑critical infrastructure within a 24x7 IOC/NOC environment.
- Working with cutting-edge AI and HPC infrastructure technologies!
- Exposure to advanced Kubernetes, cloud, and storage technologies!
- Opportunities to contribute to operational improvements and automation initiatives!
- Joining a fast‑growing organization focused on sustainable, renewable‑powered AI infrastructure!
- Collaborative environment with strong technical leadership and growth opportunities!
- 2–5 years of experience supporting or operating HPC clusters in production environments
- Strong operational experience with WEKA and VAST storage platforms
- Hands‑on experience with Kubernetes administration and troubleshooting
- Experience supporting Slurm workload manager environments
- Familiarity with HPC monitoring, observability, and alerting platforms
- Experience performing incident response and root cause analysis in complex systems
- Understanding of cloud platforms such as AWS, Azure, or GCP
- Knowledge of HPC networking and storage technologies, including Infini Band and high‑throughput interconnects
- Provide Tier 2 operational support for HPC cloud infrastructure environments
- Monitor, troubleshoot, and resolve incidents involving Kubernetes, Slurm, storage, networking, and cloud systems
- Serve as an escalation point for Tier 1 support teams
- Perform root cause analysis and coordinate with engineering teams on permanent resolutions
- Execute operational changes, upgrades, patching, and maintenance activities
- Maintain and improve operational documentation, runbooks, and knowledge base articles
- Support monitoring and observability tooling to proactively identify system issues
- Assist with operational readiness and production support for new HPC capabilities
- Mentor junior operations staff and support continuous service improvement initiatives
- Participate in on‑call rotations and major incident response activities
- Must have hands‑on experience with WEKA and VAST storage environments
- 2–5 years supporting HPC clusters in production or IOC/NOC environments
- Working knowledge of Kubernetes
- Operational experience with Slurm workload manager
- Familiarity with HPC monitoring and observability tooling
- Experience with incident response and root cause analysis
- Understanding of AWS, Azure, or GCP cloud platforms
- Knowledge of HPC networking and storage infrastructure
- Ability to work onsite in Fort Worth on a rotating 12‑hour shift schedule
Skills:
- Relevant certifications such as CKA/CKAD, RHCSA, Linux+, ITIL, or Server+
- Experience with GPU or HPC vendor technologies
- Experience supporting AI or large‑scale compute environments
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).