Senior Cloud Operations Analyst Job Farmington Hills area,Michigan USA,IT/Tech

The Senior Site Reliability Engineer is responsible for leading the management, optimization, and automation of cloud and on‑premises infrastructure to ensure seamless operations and business continuity. This role includes driving improvements in observability, server and batch operations, and data center management while proactively identifying and resolving performance and reliability issues. The Senior Cloud Operations Analyst provides technical leadership, mentors team members, and consults with cross‑functional teams to enhance operational excellence through best practices, process enhancements, and cutting‑edge technologies.

Independently develop, implement, and maintain observability tools to monitor cloud and on‑premises systems.
Actively support infrastructure teams in the management and maintenance of server systems running on Windows and Linux.
Create dashboards, alerts, and reports to track system health, performance, and availability.
Analyze metrics and logs to identify trends, prevent potential issues, and optimize system performance.
Act as the lead consultant with Fin Ops teams to monitor resource utilization and ensure cost‑effective operations across cloud environments.
Manage the lifecycle of cloud and on‑premises servers, including provisioning, patching, configuration, and decommissioning.
Troubleshoot and resolve server‑related issues, ensuring minimal downtime. Implement and enforce server security policies and compliance requirements.
Schedule, monitor, and manage batch processes to ensure timely execution of critical tasks.
Identify and resolve batch failures or delays, coordinating with relevant teams to ensure smooth operations.
Building new batch jobs for improved performance and resource utilization.
Lead on‑site and remote data center operations, ensuring proper functioning of hardware, power, cooling, and network infrastructure.
Coordinate with vendors and service providers for hardware maintenance, replacements, and upgrades.
Participate in on‑call rotations to address system incidents and outages promptly.
Conduct root cause analysis and implement solutions to prevent recurrence of issues.
Document and communicate incident resolution processes to relevant stakeholders.
Work closely with cross‑functional teams, including Dev Ops, Networking, and Application Development, to implement and maintain system integrations.
Maintain comprehensive documentation for configurations, processes, and incident resolutions.
Provide training and support to team members and other departments.

Nonessential Tasks/Marginal Duties Knowledge, Skills & Abilities:

Bachelor’s degree in computer science, Information Technology, or a related field, or equivalent experience.
5+ years of experience working with monitoring and observability tools (e.g., Datadog, Pager Duty).
Certified Pager Duty Administrator or equivalent experience required.
5+ years of experience in cloud operations or server management roles.
5+ years of progressive server administration experience (Windows, Linux).
5+ years of experience in designing, implementing, and managing IT workload automation solutions to optimize scheduling, orchestration, and execution of enterprise workflows across on‑prem and cloud environments.
Experience leveraging artificial intelligence to drive innovation and solve complex problems. Demonstrated ability to utilize AI‑driven solutions that optimize processes, enhance decision‑making, or create transformative business outcomes.
Demonstrated experience working with Infrastructure as Code (Terraform, Cloud Formation, and Ansible).
Experience leveraging artificial intelligence to drive innovation and solve complex problems. Demonstrated ability to utilize AI‑driven solutions that optimize processes, enhance decision‑making, or create transformative business outcomes.
5+ years working with cloud platforms (AWS, Azure, OCI).
Certified AWS Sys Ops Administrator or equivalent experience required.
Strong experience with data center infrastructure and best practices.
Proficiency in scripting and automation tools (Python, Bash, Power Shell).
Strong understanding of networking, security, and identity management in cloud…


Increase/decrease your Search Radius (miles)



Job Posting Language