More jobs:
SRE-Platform Lead
Job in
Las Vegas, Clark County, Nevada, 89105, USA
Listed on 2026-06-13
Listing for:
Gemini Solutions Pvt Ltd
Full Time
position Listed on 2026-06-13
Job specializations:
-
IT/Tech
SRE/Site Reliability, Cloud Computing: Infrastructure & Operations, Systems Engineer, IT Support
Job Description & How to Apply Below
Position:
Senior Site Reliability Engineer (SRE) Platform Lead
Job Type: Full Time
Immediate Interview
Role Overview- We are looking for a Senior Site Reliability Engineer (SRE) with a strong platform ownership mindset to drive reliability, scalability, and performance of mission-critical, distributed systems.
- This role sits at the intersection of software engineering, cloud infrastructure, and production operations, with a focus on building resilient systems, improving observability, automating operations, and driving reliability at scale.
- You will act as a technical lead for platform reliability, working closely with engineering and business stakeholders to ensure systems are highly available, performant, and continuously improving.
- 5+ years of experience in SRE, Dev Ops, or Production Engineering
- Experience working in production-critical environments with high availability requirements
- Exposure to global systems and cross-team collaboration
- Own availability, performance, and scalability of production systems
- Define and implement SLIs, SLOs, and error budgets
- Drive continuous improvements in system resilience and efficiency
- Lead end-to-end incident response and service restoration
- Perform deep root cause analysis across infrastructure, application, data, and network layers
- Implement long-term fixes and reduce recurrence through engineering improvements
- Design and enhance monitoring, logging, and alerting systems
- Develop actionable dashboards and improve alert quality
- Enable proactive detection of system issues
- Automate operational workflows to reduce manual effort
- Build and maintain CI/CD pipelines
- Implement Infrastructure as Code (IaC) for scalable infrastructure management
- Manage and optimize systems on modern cloud platforms
- Troubleshoot distributed systems across compute, storage, and network layers
- Diagnose latency, routing, and performance issues in globally distributed environments
- Troubleshoot data pipelines, job failures, and data inconsistencies
- Perform data validation and analysis
- Ensure reliability across data dependencies and workflows
- Diagnose issues related to DNS, HTTP/S, proxies, and load balancing
- Work with CDN and edge delivery platforms (e.g., Akamai or similar) to optimize traffic routing and performance
- Act as a liaison between engineering teams and business stakeholders
- Communicate system status, incidents, and risks with clarity and context
- Partner with cross-functional teams to drive reliability improvements
- Apply AI/ML-driven techniques for anomaly detection, alert optimization, and predictive issue identification
- Leverage intelligent automation to improve incident response and operational efficiency
- Demonstrates strong ownership of production systems and outcomes
- Independently drives incident resolution and follow-through
- Applies structured, analytical thinking to complex technical problems
- Communicates effectively in high-impact, production-critical scenarios
- Focuses on long-term reliability and scalability improvements
- Strong experience in Python for automation and tooling
- Proficiency in shell scripting (Bash)
- Experience with API-driven and event-driven automation
- Hands‑on experience with AWS, Azure, or GCP
- Strong understanding of cloud architecture, networking, and security fundamentals
- Infrastructure as Code using Terraform, Cloud Formation, or Ansible
- Experience with Jenkins, Git Lab CI, or similar tools
- Strong understanding of build, release, and deployment pipelines
- Experience with Datadog, Splunk, Prometheus, or Grafana
- Strong logging, monitoring, and alerting practices
- Familiarity with incident management tools (e.g., Pager Duty)
- Strong SQL skills for troubleshooting and validation
- Understanding of data pipelines and system dependencies
- Experience with Docker and containerized environments
- Exposure to Kubernetes and web servers (e.g., Nginx)
- Experience with Airflow, Autosys, or similar scheduling tools
- Strong understanding of DNS, HTTP/S, proxies, and load balancing
Experience with CDN and edge delivery platforms (e.g., Akamai or similar)
#J-18808-LjbffrTo View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
Search for further Jobs Here:
×