Operations Engineer Job Las Vegas area,Nevada USA,IT/Tech

About Tensor Wave

Our mission is simple: deliver seamless, secure, reliable, and resilient AI compute 've built a versatile cloud platform that eliminates infrastructure barriers, empowering builders to focus on innovation instead of fighting their stack. Because breakthrough AI should move at the speed of ideas, not infrastructure.

About the Role

We are looking for an Operations Engineer to join our Global Operations Center team as the frontline of Tensor Wave's customer infrastructure reliability. This role is focused on monitoring customer environment health, detecting issues before they impact workloads, and serving as the L1 response to customer-reported problems. This role is based in Tensor Wave headquarters in Las Vegas and is part of our customer facing 24/7 team.

You'll be responsible for monitoring systems, executing runbooks, and coordinating with on-site teams and engineering when escalation is needed.

Tensor Wave is building its Operations Center from the ground up, and early team members will have a direct impact on how we keep our customers' most critical workloads running. This role is ideal for someone who is sharp under pressure, naturally detail-oriented, and motivated by the knowledge that their work directly protects customer outcomes.

What You'll Do

* Monitor customer environments in real time across Tensor Wave data centers using monitoring and observability platforms

* Track key health indicators including GPU utilization, node availability, network performance, storage health, and Kubernetes cluster status

* Identify anomalies, degradations, and emerging issues before they escalate into customer-impacting events

* Maintain situational awareness of active customer workloads, scheduled maintenance windows, and known issues across the fleet

* Provide regular health summaries and flag trends that may indicate systemic risks to customer environments

* Serve as the first responder to customer-reported issues and system-generated alerts, performing initial triage and classification

* Execute established runbooks to diagnose and resolve common infrastructure issues including node failures, connectivity problems, and resource contention

* Escalate issues to L2 engineering or on-site data center teams with clear, actionable context

* Maintain accurate incident records including timeline, actions taken, and resolution details in the ticketing system

* Communicate status updates to internal stakeholders during active incidents, ensuring visibility across operations and customer-facing teams

* Follow and contribute to operational runbooks and standard operating procedures, identifying gaps or improvements based on real-world incidents

* Assist with monitoring and alerting tuning by providing feedback on alert quality, false positive rates, and coverage gaps

* Document tribal knowledge, recurring issue patterns, and lessons learned to strengthen the team's operational knowledge base

* Participate in post-incident reviews, contributing observations from the frontline monitoring and response perspective

* Support change management processes by monitoring customer environments during planned maintenance and infrastructure changes

* Coordinate with on-site data center operations teams for hands-on remediation activities that require physical access

Who You Are

Required Qualifications

* 1-3 years of experience in a NOC, operations center, technical support, systems administration, or similar infrastructure operations role

* Experience monitoring production infrastructure using observability tools (Grafana, Datadog, Prometheus, or similar)

* Foundational Linux systems administration skills with the ability to navigate systems, read logs, and execute diagnostic commands

* Basic understanding of networking fundamentals including TCP/IP, DNS, and VLANs

* Experience following operational runbooks and structured triage procedures in a production environment

* Strong written communication skills, particularly the ability to write clear incident updates and escalation summaries under time pressure

* Demonstrated ability to stay calm, prioritize effectively, and work methodically during high-pressure situations

*…


Increase/decrease your Search Radius (miles)



Job Posting Language