IOC Systems Analyst Job Fort Worth area,Texas USA,IT/Tech

Optomi, in partnership with a leading AI Cloud Service Provider, is seeking an IOC Systems Specialist to join a fast-paced operations team supporting large-scale HPC and GPU cloud environments.

Position Summary

The IOC Systems Specialist is responsible for providing Tier 2 operational support for high-performance computing (HPC) cloud infrastructure in a 24x7 IOC/NOC environment. This role focuses on monitoring, troubleshooting, and resolving complex incidents across Kubernetes clusters, Slurm-managed workloads, cloud services, and large-scale storage environments. The specialist will help ensure system stability, performance, and uptime while supporting mission-critical AI and GPU computing operations.

What

the right candidate will enjoy

Working with cutting-edge AI and HPC infrastructure technologies
Supporting large-scale GPU cloud environments in a highly technical operations setting
Collaborating with engineering and infrastructure teams to solve complex production issues
Opportunities to grow within cloud, Kubernetes, HPC, and observability technologies
Being part of a sustainability-focused organization powered by renewable energy

What type of experience the right candidate has

2–5 years of experience supporting or operating HPC clusters in a production IOC/NOC environment
Hands‑on experience with Kubernetes and Slurm workload manager
Experience supporting storage technologies such as WEKA and VAST
Background in incident response, troubleshooting, and root cause analysis within complex systems
Familiarity with cloud platforms such as AWS, Azure, or GCP
Understanding of HPC networking and storage infrastructure, including Infini Band, Ethernet fabrics, and high‑throughput storage environments
Post‑secondary education in Computer Science, Engineering, or related technical discipline, or equivalent hands‑on experience

What the responsibilities are of the right candidate

Provide Tier 2 operational support for HPC cloud environments while maintaining system stability and SLA adherence
Monitor, troubleshoot, and resolve incidents related to Kubernetes, Slurm, storage systems, and associated cloud infrastructure
Act as an escalation point for Tier 1 support teams and coordinate with engineering teams for permanent resolution of issues
Perform root cause analysis and contribute to continuous operational improvements
Execute operational changes, maintenance activities, patching, and upgrades following change management procedures
Support and maintain monitoring, alerting, and observability tools for proactive issue detection
Maintain runbooks, operational documentation, incident reports, and knowledge base articles
Support operational readiness for new HPC technologies and infrastructure deployments
Provide guidance and mentorship to Tier 1 operations staff and data center technicians
Participate in a 24x7 rotating shift schedule and major incident response activities

#J-18808-Ljbffr