Major Incident Manager
Listed on 2026-03-01
-
IT/Tech
IT Support, Technical Support, Cybersecurity, IT Project Manager
The Incident Manager role is critical to maintaining service reliability and preserving customer trust. This position directly impacts company success by minimizing downtime, managing high-severity incidents, and ensuring rapid resolution of complex technical challenges. You will lead the response to high-visibility incidents and customer escalations, acting as a central point of coordination to drive timely, effective outcomes.
In this role, you’ll spearhead the management of critical incidents from identification through resolution, while continuously improving incident response processes and support readiness. You’ll work cross-functionally with engineering, product, and customer teams to design scalable self-service support workflows, contribute to product improvements, and develop robust incident response strategies. You’ll also play a key role in mentoring team members, delivering training, and building knowledge resources that strengthen both internal teams and customer success.
We’re looking for a technically skilled professional with strong Linux expertise, excellent communication skills, and 4–5 years of customer-facing experience. Prior experience in incident management and on-call rotations is essential.
What You’ll Be Working On- Diagnose and resolve complex technical issues related to Infini Band
, containerization, and distributed training environments - Lead high-severity incident response efforts to ensure rapid mitigation and minimal disruption to customer operations
- Manage customer escalations with professionalism, clarity, and urgency, ensuring stakeholder confidence throughout the incident lifecycle
- Guide customers through the implementation, configuration, and optimization of HPC infrastructure
- Partner with customers to improve performance, scalability, and efficiency across their environments
- Develop and deliver internal and external training materials, including live training sessions, documentation, and knowledge base articles
- Provide ongoing enablement to help customers effectively adopt and maximize the value of company solutions
- Lead incident response training and preparedness initiatives for internal teams
- Work closely with engineering and product teams to share customer feedback and operational insights
- Influence product enhancements and reliability improvements based on real-world incident data
- Contribute to the continuous improvement of incident management processes and the overall customer experience
- Strong hands-on experience with Linux
, virtualization
, Kubernetes
, and managing customer incidents - Solid understanding of the TCP/IP stack
- Working knowledge of Infrastructure-as-Code (IaC) practices
- Excellent written and verbal communication skills, with the ability to clearly explain complex technical issues
- Proven problem-solving mindset with strong diagnostic and analytical abilities
- 3–5+ years of experience in a team leadership role
, serving as a liaison between internal teams and external customers - 4–5 years of customer-facing experience in a technical environment
- Direct experience participating in or leading incident management efforts and on-call rotations
- Programming experience in one or more programming languages
- Restricted Stock Units (RSUs) in a fast-growing, well-funded technology company
- Comprehensive health insurance options, including HDHP and PPO plans
, plus vision and dental coverage for you and your dependents - Employer contributions to HSA accounts
- Paid parental leave
- Company-paid life insurance, short-term disability, and long-term disability coverage
- 401(k) plan with a 100% company match up to 4% of salary
- Generous paid time off and holiday schedule
- Cell phone reimbursement
- Subscription to the Calm app
- Met Life Legal benefits
- Company-paid Commuter FSA benefit of $200 per month
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).