Site Reliability Engineer Job Frisco area,Texas USA,IT/Tech

Job Title

Site Reliability Engineer

About Skyhigh Security

Skyhigh Security is a dynamic, fast‑paced, cloud company that is a leader in the security industry. Our mission is to protect the world’s data, and because of this, we live and breathe security. We value learning at our core, underpinned by openness and transparency. Since 2011, organizations have trusted us to provide them with a complete, market‑leading security platform built on a modern cloud stack.

Our industry‑leading suite of products radically simplifies data security through easy‑to‑use, cloud‑based, Zero Trust solutions that are managed in a single dashboard, powered by hundreds of employees across the world. With offices in Santa Clara, Aylesbury, Paderborn, Bengaluru, Sydney, Tokyo and more, our employees are the heart and soul of our company. Skyhigh Security is more than a company; here, when you invest your career with us, we commit to investing in you.

We embrace a hybrid work model, creating the flexibility and freedom you need from your work environment to reach your potential. From our employee recognition program, to our “Blast Talks” learning series, and team celebrations (we love to have fun!), we strive to be an interactive and engaging place where you can be your authentic self.

Role Overview

The Site Reliability Engineer at Skyhigh Security will be responsible for monitoring, maintaining, and troubleshooting operational issues of a high‑availability production environment.

Job Summary

The Site Reliability Engineer at Skyhigh Security will be responsible for monitoring, maintaining and troubleshooting operational issues of a high‑availability production environment. The SRE will also act as a bridge between Operations, Engineering and Product Management teams and you will represent the customer point of view to continue driving enhancements to our products and uptime. SREs are responsible for managing and improving the operational aspects of systems, such as monitoring, alerting, incident response, and vendor interactions.

Only US Citizens are eligible.

About the role

Perform Incident Management and Change Management to maintain the continuous availability of all Cloud Infrastructure services.
Ensure all SRE and operating procedures are maintained and executed.
Maintain a 24x7 production environment with a high level of service availability and perform quality reviews, manage operational issues.
Perform root cause analysis for major incidents and drive the process by involving required stakeholders.
Perform problem management by analyzing metrics, alarms and dashboards to troubleshoot problem areas, report issues to assist in performance tuning and fault finding.
Implementation of proactive monitoring, alerting, trend analysis, and self‑healing solutions.
Explore and innovate new technologies, features, and tools to improve the platform and automate operational tasks using Bash, Python or any other programming language.
Manage and maintain Runbooks and Standard Operating procedures
Manage, coordinate, and document all types of maintenance activities and outages.
Perform patching and upgrades for vulnerability management.
Work closely with the teams to initiate the development of new ideas into internal tools.
Understand the existing architecture and work with various Engineering teams to develop and execute strategies to provide a high‑quality production service.
Capable of working a flexible work schedule in a 24x7 environment with rotational shifts.

About you

Bachelor’s degree in computer science, electrical engineering or a related area, with 7+ years of SRE experience in a large enterprise organization
System admin experience on Linux environments.
Experience with end‑to‑end monitoring setup for infra and applications
Experience with Prometheus, Grafana, ELK, Open search, Cloudwatch, Pager Duty and other monitoring tools.
Solid experience with Cloud Technologies such as AWS and OCI.
Good experience with containerized workloads tools like Kubernetes.
Network knowledge (TCP/IP, UDP, DNS, Load balancing) and prior network administration experience is required.
Experience with BGP,…


Increase/decrease your Search Radius (miles)



Job Posting Language