Principal Site Reliability Engineer Job Aurora area,Illinois USA,IT/Tech

Overview

We are seeking a Principal Site Reliability Engineer with deep systems-level expertise and a proven track record of resolving complex, high-stakes incidents in production healthcare environments. This senior technical role sits at the intersection of platform operations, customer support, and clinical interoperability — requiring someone capable of working with live Linux hosts and containers to trace issues, reconstructing DICOM workflows from raw network captures and logs, and guiding a hospital IT team through connectivity and data issues over a bridge call.

The Principal Site Reliability Engineer operates at the same technical depth as senior members of the engineering team, serving as the highest escalation point for production incidents and the primary technical interface for complex customer deployments. The successful candidate combines the investigative instincts of a site reliability engineer with the communication clarity of a solutions architect and the domain fluency of a healthcare IT professional.

Key Responsibilities

Production Incident Response & Troubleshooting
- Serve as the senior escalation point for critical production incidents across cloud-hosted and customer-premise deployments, owning issues from initial triage through root cause identification and resolution.
- Perform advanced live-system diagnostics on Linux hosts and Docker/ECS container environments, including log aggregation, process inspection, resource contention analysis, and crash dump review.
- Analyze Laravel/PHP, C# and Rust application behavior in production: parsing structured and unstructured application logs, tracing exception stack traces, diagnosing session and cache failures (Redis/Elasti Cache), and identifying OOM conditions, deadlocks, or misconfigured queue workers.
- Investigate and resolve AWS infrastructure-layer issues spanning EC2 instances, ECS task and service health, SQS message backlog and DLQ accumulation, SNS delivery failures, S3 access and policy errors, Aurora connection pool exhaustion, and Cloud Watch alarm and metric anomalies.
- Conduct in-depth container-level debugging: inspecting Docker layer builds, ECS task definition misconfigurations, networking between tasks via Cloud Map or ALB target groups, and environment variable or secrets injection failures from Secrets Manager.
- Use Cloud Watch Logs Insights, Metrics, and X-Ray (or equivalent) to correlate distributed system failures across service boundaries and identify latency outliers, error rate spikes, and cascading failure patterns.

DICOM & HL7 Troubleshooting
- Diagnose and resolve failures in DICOM workflows including C-STORE, C-FIND, C-MOVE, C-GET operations and DICOMweb (WADO-RS, STOW-RS, QIDO-RS) endpoints, using tools such as DCMTK utilities, dcmdump, storescu, findscu, and Wireshark/tcpdump packet captures.
- Troubleshoot DICOM association negotiation failures, transfer syntax mismatches, SOP class rejection, and modality connectivity issues across multi-site PACS and VNA deployments.
- Analyze HL7 v2 message flows (ADT, ORM, ORU, MDM) through integration engines and custom adapters, identifying parsing errors, field mapping failures, segment ordering issues, and acknowledgment (ACK/NACK) problems.
- Collaborate with clinical informatics and integration teams at customer sites to resolve interoperability issues between modalities, RIS, EHR, and the platform's imaging exchange infrastructure.

Multi-Site & Customer Network Troubleshooting
- Lead technical engagement for complex multi-site deployment issues spanning customer on-premise networks.
- Diagnose network-layer issues affecting DICOM connectivity: port accessibility, firewall rule conflicts, MTU mismatch, TLS certificate errors, and proxy interference with DICOMweb or HL7 MLLP traffic.
- Engage directly with customer IT and networking teams to coordinate resolution of infrastructure-side issues, translating complex platform requirements into actionable guidance for non-specialist audiences.
- Document multi-site deployment architectures, network topology dependencies, and known issue patterns in Confluence to build institutional knowledge and accelerate future incident resolution.

Operational Excellence & Team Enablement
- Author detailed post-incident reports (PIRs) with timeline reconstruction, root cause analysis, contributing factors, and corrective action items, distributing findings to engineering, product, and customer stakeholders.
- Build and maintain runbooks, diagnostic playbooks, and escalation decision trees in Confluence for common failure categories, enabling support and customer success teams to handle a larger share of incidents independently.
- Partner with engineering teams to surface systemic issues discovered through support patterns, advocating for observability improvements, defensive coding practices, and configuration guardrails.
- Define and track support SLA metrics including MTTR, escalation rate, and repeat incident frequency, reporting trends to leadership and recommending operational…