Director,Site Reliability Engineering & Cloud Operations; SRE Job Golden Valley area,Minnesota USA,IT/Tech

Position: Director, Site Reliability Engineering & Cloud Operations (SRE)

Job Description

At Resideo, we imagine a world where homes and buildings are good for the planet, and where technology works to simplify everyday life. In that world, people are healthy, happy, and secure. To help create this future, we will work every day to simplify the connected world so people have peace of mind and can focus on what matters most. Resideo is making a large investment in our engineering group.

With global reach and impact, we are dedicated to building our team as we develop new products and introduce them to consumers around the world (NPI). Being an established leader in the connected products space, we will give you a platform to work on new and innovative projects as a member of a team of intelligent innovators that are developing products that truly align with our mission of protecting what matters most.

This is an exciting opportunity to lead cloud operations for one of the largest IoT ecosystems in the world, shaping the future of cloud infrastructure, SRE, and AI‑driven operations. You'll work alongside world‑class engineering talent and cutting‑edge technologies to ensure Resideo’s mission of simplifying everyday life through innovative connected products. As a leader, you will have the opportunity to lead the platform engineering transformation in a global organization of multiple teams, delivering on business priorities while collaborating with development leaders and executives to define and advance best practices.

Job Duties

Cloud Infrastructure & SRE Strategy
- Define and execute global cloud operations and SRE strategies, ensuring 99.999%+ uptime for mission‑critical IoT services.
- Architect, implement, and optimize multi‑cloud infrastructure to support IoT devices with low‑latency data processing, scalability, and high availability.
- Drive cost optimization strategies while balancing performance, redundancy, and financial efficiency across cloud platforms (Azure).
- Develop automated deployment, monitoring, and recovery systems using technologies such as Kubernetes, Terraform, Ansible, and CI/CD pipelines.
Reliability, Performance & Incident Management
- Establish and refine SLOs, SLIs, and KPIs for service reliability, performance, and capacity planning.
- Build and optimize incident management, disaster recovery, and resilience engineering frameworks.
- Leverage AI/ML‑driven automation for proactive failure detection and remediation.
Security & Compliance
- Implement robust security practices and ensure cloud security, compliance with standards such as SOC2, GDPR, and NIST, and oversee the zero‑trust security model for IoT data protection.
- Collaborate with security and compliance teams to manage risk and ensure regulatory adherence across cloud platforms.
Team Leadership & Cross‑Functional Collaboration
- Lead and mentor a global team of Cloud Engineers, SREs, and SW professionals, fostering a culture of continuous learning and innovation.
- Partner with product management, software engineering, and customer support to optimize IoT device onboarding, firmware updates, and cloud‑to‑edge performance.
- Collaborate with finance and executive leadership to develop long‑term cloud investment strategies.

Required Qualifications

15+ years in Computer Science, Electrical Engineering, or a related field
15+ years of experience in Cloud Operations, SRE, or Infrastructure Engineering, with 8+ years in technical leadership roles
5+ years of experience managing large‑scale, distributed IoT cloud environments supporting billions of data points per day
5+ years of deep professional experience in Azure cloud platforms including networking, storage, compute, and database services
5+ years of experience in Kubernetes, Terraform, CI/CD pipelines, and observability tools (e.g., Prometheus, Grafana, ELK, etc.)
5+ years of experience in large‑scale systems design and architecture, with a focus on reliability, performance, and scalability of cloud‑native platforms
5+ years of hands‑on experience with tools such as Terraform, Ansible, CDK, Pulumi for Infrastructure‑as‑Code (IaC), and managing cloud‑native architectures

What We Value

Strong background in AI/ML‑driven automation for cloud infrastructure monitoring,…

Director, Site Reliability Engineering & Cloud Operations; SRE