IOC Systems Analyst
Job in
Fort Worth, Tarrant County, Texas, 76102, USA
Listed on 2026-06-12
Listing for:
Optomi
Full Time
position Listed on 2026-06-12
Job specializations:
-
IT/Tech
Systems Engineer, Cloud Computing, SRE/Site Reliability, IT Support
Job Description & How to Apply Below
Optomi, in partnership with a leading AI Cloud Service Provider, is seeking an IOC Systems Specialist to join a fast-paced operations team supporting large-scale HPC and GPU cloud environments.
Position SummaryThe IOC Systems Specialist is responsible for providing Tier 2 operational support for high-performance computing (HPC) cloud infrastructure in a 24x7 IOC/NOC environment. This role focuses on monitoring, troubleshooting, and resolving complex incidents across Kubernetes clusters, Slurm-managed workloads, cloud services, and large-scale storage environments. The specialist will help ensure system stability, performance, and uptime while supporting mission-critical AI and GPU computing operations.
Whatthe right candidate will enjoy
- Working with cutting-edge AI and HPC infrastructure technologies
- Supporting large-scale GPU cloud environments in a highly technical operations setting
- Collaborating with engineering and infrastructure teams to solve complex production issues
- Opportunities to grow within cloud, Kubernetes, HPC, and observability technologies
- Being part of a sustainability-focused organization powered by renewable energy
- 2–5 years of experience supporting or operating HPC clusters in a production IOC/NOC environment
- Hands‑on experience with Kubernetes and Slurm workload manager
- Experience supporting storage technologies such as WEKA and VAST
- Background in incident response, troubleshooting, and root cause analysis within complex systems
- Familiarity with cloud platforms such as AWS, Azure, or GCP
- Understanding of HPC networking and storage infrastructure, including Infini Band, Ethernet fabrics, and high‑throughput storage environments
- Post‑secondary education in Computer Science, Engineering, or related technical discipline, or equivalent hands‑on experience
- Provide Tier 2 operational support for HPC cloud environments while maintaining system stability and SLA adherence
- Monitor, troubleshoot, and resolve incidents related to Kubernetes, Slurm, storage systems, and associated cloud infrastructure
- Act as an escalation point for Tier 1 support teams and coordinate with engineering teams for permanent resolution of issues
- Perform root cause analysis and contribute to continuous operational improvements
- Execute operational changes, maintenance activities, patching, and upgrades following change management procedures
- Support and maintain monitoring, alerting, and observability tools for proactive issue detection
- Maintain runbooks, operational documentation, incident reports, and knowledge base articles
- Support operational readiness for new HPC technologies and infrastructure deployments
- Provide guidance and mentorship to Tier 1 operations staff and data center technicians
- Participate in a 24x7 rotating shift schedule and major incident response activities
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
Search for further Jobs Here:
×