×
Register Here to Apply for Jobs or Post Jobs. X

SRE-Platform Lead

Job in Las Vegas, Clark County, Nevada, 89105, USA
Listing for: Gemini Solutions Pvt Ltd
Full Time position
Listed on 2026-06-13
Job specializations:
  • IT/Tech
    SRE/Site Reliability, Cloud Computing: Infrastructure & Operations, Systems Engineer, IT Support
Salary/Wage Range or Industry Benchmark: 60000 - 80000 USD Yearly USD 60000.00 80000.00 YEAR
Job Description & How to Apply Below

Position:
Senior Site Reliability Engineer (SRE) Platform Lead

Job Type: Full Time

Immediate Interview

Role Overview
  • We are looking for a Senior Site Reliability Engineer (SRE) with a strong platform ownership mindset to drive reliability, scalability, and performance of mission-critical, distributed systems.
  • This role sits at the intersection of software engineering, cloud infrastructure, and production operations, with a focus on building resilient systems, improving observability, automating operations, and driving reliability at scale.
  • You will act as a technical lead for platform reliability, working closely with engineering and business stakeholders to ensure systems are highly available, performant, and continuously improving.
Experience
  • 5+ years of experience in SRE, Dev Ops, or Production Engineering
  • Experience working in production-critical environments with high availability requirements
  • Exposure to global systems and cross-team collaboration
Key Responsibilities Platform Reliability & Ownership
  • Own availability, performance, and scalability of production systems
  • Define and implement SLIs, SLOs, and error budgets
  • Drive continuous improvements in system resilience and efficiency
Incident Management & Root Cause Analysis
  • Lead end-to-end incident response and service restoration
  • Perform deep root cause analysis across infrastructure, application, data, and network layers
  • Implement long-term fixes and reduce recurrence through engineering improvements
Observability & Monitoring
  • Design and enhance monitoring, logging, and alerting systems
  • Develop actionable dashboards and improve alert quality
  • Enable proactive detection of system issues
Automation & Dev Ops Practices
  • Automate operational workflows to reduce manual effort
  • Build and maintain CI/CD pipelines
  • Implement Infrastructure as Code (IaC) for scalable infrastructure management
  • Manage and optimize systems on modern cloud platforms
  • Troubleshoot distributed systems across compute, storage, and network layers
  • Diagnose latency, routing, and performance issues in globally distributed environments
Data & Workflow Reliability
  • Troubleshoot data pipelines, job failures, and data inconsistencies
  • Perform data validation and analysis
  • Ensure reliability across data dependencies and workflows
Networking & Traffic Management
  • Diagnose issues related to DNS, HTTP/S, proxies, and load balancing
  • Work with CDN and edge delivery platforms (e.g., Akamai or similar) to optimize traffic routing and performance
Stakeholder Collaboration
  • Act as a liaison between engineering teams and business stakeholders
  • Communicate system status, incidents, and risks with clarity and context
  • Partner with cross-functional teams to drive reliability improvements
AI-Driven Reliability (Emerging Focus)
  • Apply AI/ML-driven techniques for anomaly detection, alert optimization, and predictive issue identification
  • Leverage intelligent automation to improve incident response and operational efficiency
Core Expectations
  • Demonstrates strong ownership of production systems and outcomes
  • Independently drives incident resolution and follow-through
  • Applies structured, analytical thinking to complex technical problems
  • Communicates effectively in high-impact, production-critical scenarios
  • Focuses on long-term reliability and scalability improvements
Technical Skills Programming & Automation
  • Strong experience in Python for automation and tooling
  • Proficiency in shell scripting (Bash)
  • Experience with API-driven and event-driven automation
  • Hands‑on experience with AWS, Azure, or GCP
  • Strong understanding of cloud architecture, networking, and security fundamentals
  • Infrastructure as Code using Terraform, Cloud Formation, or Ansible
Dev Ops & CI/CD
  • Experience with Jenkins, Git Lab CI, or similar tools
  • Strong understanding of build, release, and deployment pipelines
Observability
  • Experience with Datadog, Splunk, Prometheus, or Grafana
  • Strong logging, monitoring, and alerting practices
  • Familiarity with incident management tools (e.g., Pager Duty)
Data & Databases
  • Strong SQL skills for troubleshooting and validation
  • Understanding of data pipelines and system dependencies
Systems & Platform
  • Experience with Docker and containerized environments
  • Exposure to Kubernetes and web servers (e.g., Nginx)
Orchestration
  • Experience with Airflow, Autosys, or similar scheduling tools
Networking & CDN
  • Strong understanding of DNS, HTTP/S, proxies, and load balancing

Experience with CDN and edge delivery platforms (e.g., Akamai or similar)

#J-18808-Ljbffr
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
 
 
 
Search for further Jobs Here:
(Try combinations for better Results! Or enter less keywords for broader Results)
Location
Increase/decrease your Search Radius (miles)
0
200
Filters
Education Level
Experience Level (years)
Posted in last:
Salary