Senior Site Reliability Engineer
Listed on 2025-12-08
-
IT/Tech
Systems Engineer, Cloud Computing
Overview
Lakeview IT is passionate about delivering high‑quality products and services to our customers. Our technology operations team is committed to ensuring reliable, scalable, and high‑performing services for our clients. We are looking for a talented and motivated Site Reliability Engineer to join our dynamic team and help us continue to build and maintain a world‑class infrastructure.
The Sr. Site Reliability Engineer at Lakeview is responsible for ensuring the availability, performance, and scalability of the company’s critical systems. They will lead the design and implementation of infrastructure solutions, focusing on automation, monitoring, and high reliability. This role involves optimizing system performance, managing incident responses, and conducting post‑mortems to drive continuous improvement. The engineer will also work closely with engineering, development, and operations teams to create and enforce best practices, establish service‑level objectives, and ensure seamless deployment processes.
Additionally, mentoring junior team members and driving key architectural decisions are essential aspects of the role to build a culture of reliability and operational excellence.
Salary range for the role is between $130,000 and $150,000 with an annual bonus. The position can be 100% remote; if located in the Agoura Hills, CA area, the expectation will be that the role is hybrid.
Responsibilities- Proactively identify and resolve incidents before they impact operations.
- Monitor all systems and infrastructure for the highest level of availability.
- Perform routine maintenance tasks, including monitoring, patching, and backups.
- Respond to incidents and outages in a timely and effective manner.
- Collaborate with other teams to diagnose and resolve complex issues.
- Document incident details and implement corrective actions to prevent recurrence.
- Document processes, configurations, and troubleshooting procedures.
- Diagnose and resolve application performance problems or system outages.
- Play the role of Incident Manager during outages.
- Resolve complex hardware and software issues, and work with vendors when necessary.
- Optimize system performance and resource utilization on‑prem and in the cloud.
- Develop and maintain automation scripts to streamline repetitive tasks.
- Utilize scripting languages (e.g., Power Shell, Python) to automate system administration.
- Implement configuration management tools to ensure consistency and repeatability.
- Create and maintain comprehensive documentation of IT processes and procedures.
- Lead the design, development, and implementation of reliable, scalable infrastructure systems.
- Mentor junior SREs, guiding on best practices and technical issues.
- Architect and execute disaster recovery and high‑availability plans.
- Drive incident management processes, ensuring swift and effective resolution of critical issues.
- Optimize system performance through proactive monitoring, tuning, and capacity planning.
- Lead root‑cause analysis and post‑mortem discussions to identify long‑term fixes.
- Develop and maintain complex automation scripts to enhance system reliability.
- Influence reliability improvements within the engineering organization, promoting a culture of observability and resilience.
- Champion the adoption of new tools and technologies that enhance system stability and deployment efficiency.
- Communicate effectively with stakeholders and executive leadership regarding system status, incidents, and upcoming reliability initiatives.
- Strong understanding of IT infrastructure components, including servers, networks, and storage.
- Knowledge in scripting languages (e.g., Power Shell, Python).
- Knowledge of networking concepts and protocols (e.g., TCP/IP, DNS, DHCP).
- Experience with IT service management frameworks.
- Experience with cloud platforms such as AWS and Azure.
- Experience with virtualization technologies such as Azure VDI, AWS Work spaces.
- Experience with monitoring and alerting tools (e.g., New Relic, Datadog).
- Excellent problem‑solving and analytical skills.
- Strong communication and interpersonal skills.
- Extensive expertise in the Windows operating system.
The…
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).