×
Register Here to Apply for Jobs or Post Jobs. X

Reliability & Observability Analyst II

Job in Dallas, Dallas County, Texas, 75215, USA
Listing for: Iris Energy
Full Time position
Listed on 2026-06-18
Job specializations:
  • IT/Tech
    SRE/Site Reliability, IT Support, Cybersecurity
Salary/Wage Range or Industry Benchmark: 80000 - 100000 USD Yearly USD 80000.00 100000.00 YEAR
Job Description & How to Apply Below

Job Type: Full-Time |

Location:

Dallas / Fort Worth, TX | Department:
Operations | Reporting to:
Data Center Manager | Work Location Type: onsite

IREN is a leading next‑generation data center business powering the future with 100% renewable energy. We build, own and operate our data centers and take pride in being at the forefront of sustainable solutions for the ever‑evolving applications of high‑performance computers. We believe that human progress is invaluable, but it should be done in the right way – responsibly, sustainably and having a positive impact on the communities we operate in.

Position

Summary

We are seeking an IOC Reliability & Observability Analyst II to support our 24/7 HPC Data Center Operations by performing advanced incident triage, improving alert quality and routing, and maintaining high‑quality operational telemetry and reporting. This role partners with engineering and operations teams to identify detection gaps, tune monitoring and dashboards, and implement small automations and enrichment to reduce operational toil and improve time‑to‑action.

Job Responsibilities
  • Perform advanced Level 2 incident analysis by reviewing incident data, system behavior, and operational signals across GPU clusters, networks, and facilities to identify recurring issues, improve triage accuracy, and support faster and more effective escalation.
  • Maintain IOC service health dashboards and operational metrics that reflect alert effectiveness and incident response performance (e.g., MTTD/MTTR) for day‑to‑day operations and leadership reporting.
  • Identify alerting and monitoring gaps, under‑monitored systems, and noisy or ineffective alerts; tune thresholds, routing, suppression, and enrichment within IOC tooling and partner with engineering teams for instrumentation changes.
  • Own operational alert quality outcomes by ensuring sustained reductions in false positives, missed detections, poor routing, and alert fatigue through IOC‑approved standards, validation, and continuous review.
  • Analyze GPU health and performance signals during incidents to support faster triage, improve escalation quality, and reduce customer impact in GPU‑based environments.
  • Validate and oversee automated detection and correlation outputs, ensuring alerts, anomalies, and insights are accurate, actionable, and aligned with operational reality.
  • Implement and maintain IOC‑level automation (alert routing rules, enrichment fields, ticket templates, runbook scripts) to standardize response and reduce manual toil during incidents.
  • Ensure ITSM incident and ticket records meet IOC quality standards by validating timelines, categorizations, ownership, and resolution notes; support RCA workflows with complete operational inputs.
  • Provide peer coaching and onboarding support to Analyst I team members on triage patterns, alert interpretation, dashboard usage, and runbook usage; contribute to operational documentation.
  • Support IOC shift operations through detailed incident handoffs, queue hygiene, and coordination with on‑call engineering and facilities teams during escalations.
Qualifications
  • 3–5 years of experience in IOC/NOC/SRE‑adjacent operations, reliability engineering, observability, or production support roles within 24/7 production environments.
  • Bachelor’s degree in Computer Science, Data Science, IT, or equivalent hands‑on professional experience.
  • Demonstrated ability to apply reliability engineering principles (incident lifecycle, MTTD/MTTR, operational risk) to improve detection, response effectiveness, and overall service stability.
  • Strong working knowledge of Linux systems, basic networking, and infrastructure dependencies across compute, network, and facility domains.
  • Practical experience supporting GPU‑based compute environments or high‑density clusters, including analysis of GPU health, performance degradation, and failure patterns.
  • Proven experience owning and improving alert quality, reducing false positives, missed detections, poor routing, and alert fatigue in complex environments.
  • Hands‑on experience maintaining service health dashboards and operational reliability metrics, including SLI/SLO reporting where defined.
  • Ability to…
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
 
 
 
Search for further Jobs Here:
(Try combinations for better Results! Or enter less keywords for broader Results)
Location
Increase/decrease your Search Radius (miles)
0
200
Filters
Education Level
Experience Level (years)
Posted in last:
Salary