SRE-Platform Lead Job Las Vegas area,Nevada USA,IT/Tech

Position:
Senior Site Reliability Engineer (SRE) Platform Lead

Job Type: Full Time

Immediate Interview

Role Overview

We are looking for a Senior Site Reliability Engineer (SRE) with a strong platform ownership mindset to drive reliability, scalability, and performance of mission-critical, distributed systems.
This role sits at the intersection of software engineering, cloud infrastructure, and production operations, with a focus on building resilient systems, improving observability, automating operations, and driving reliability at scale.
You will act as a technical lead for platform reliability, working closely with engineering and business stakeholders to ensure systems are highly available, performant, and continuously improving.

Experience

5+ years of experience in SRE, Dev Ops, or Production Engineering
Experience working in production-critical environments with high availability requirements
Exposure to global systems and cross-team collaboration

Key Responsibilities Platform Reliability & Ownership

Incident Management & Root Cause Analysis

Lead end-to-end incident response and service restoration
Perform deep root cause analysis across infrastructure, application, data, and network layers
Implement long-term fixes and reduce recurrence through engineering improvements

Observability & Monitoring

Automation & Dev Ops Practices

Automate operational workflows to reduce manual effort
Build and maintain CI/CD pipelines
Implement Infrastructure as Code (IaC) for scalable infrastructure management
Manage and optimize systems on modern cloud platforms
Troubleshoot distributed systems across compute, storage, and network layers
Diagnose latency, routing, and performance issues in globally distributed environments

Data & Workflow Reliability

Networking & Traffic Management

Diagnose issues related to DNS, HTTP/S, proxies, and load balancing
Work with CDN and edge delivery platforms (e.g., Akamai or similar) to optimize traffic routing and performance

Stakeholder Collaboration

AI-Driven Reliability (Emerging Focus)

Apply AI/ML-driven techniques for anomaly detection, alert optimization, and predictive issue identification
Leverage intelligent automation to improve incident response and operational efficiency

Core Expectations

Technical Skills Programming & Automation

Strong experience in Python for automation and tooling
Proficiency in shell scripting (Bash)
Experience with API-driven and event-driven automation
Hands‑on experience with AWS, Azure, or GCP
Strong understanding of cloud architecture, networking, and security fundamentals
Infrastructure as Code using Terraform, Cloud Formation, or Ansible

Dev Ops & CI/CD

Observability

Data & Databases

Systems & Platform

Orchestration

Networking & CDN

Experience with CDN and edge delivery platforms (e.g., Akamai or similar)

#J-18808-Ljbffr