Sr Systems Engineer SRE
Listed on 2026-02-09
-
IT/Tech
Cloud Computing, Systems Engineer, SRE/Site Reliability
Company URL: https://
Berkley Technology Services (BTS) is the dynamic technology solution for W. R. Berkley Corporation, a Fortune 500 Commercial Lines Insurance Company. With key locations in Urbandale, IA and Wilmington, DE, BTS provides innovative and customer-focused IT solutions to the majority of WRBC’s 60+ operating units across the globe. BTS’s wide reach ensures that ideas and opinions are considered at every level of the organization to guarantee we find the best solutions possible.
Driven by a commitment to collaboration, BTS acts as consultants to our customers and Operating Units by providing comprehensive solutions that not only address the challenge at hand, but proactively plan for the “What’s Next” in our industry and beyond.
With a culture centered on innovation and entrepreneurial spirit, BTS stands as a community of technology leaders with eyes toward the future -- leaders who genuinely care about growing not only their team members, but themselves, and take pride in their employees who shine. BTS offers endless ways to get involved and have the chance to grow your career into a wide range of roles you had never known existed.
Come join us as we push forward into the future of industry’s leading technological solutions.
Berkley Technology Services:
Right Team, Right Technology, Simple and Secure.
As a Sr Systems Engineer, SRE - you will play a crucial role in ensuring the reliability, scalability, and performance of software systems. Collaborating closely with teams, you will have the opportunity to set and enforce best practices, ensure scalability, reliability, and security of our cloud and on-premises environments.
This role requires a strong understanding of the entire technology stack (network, storage, OS, virtualization, database, development, applications) to observe, monitor, troubleshoot, and automate activity in the Berkley environment.
- Define, and track reliability and observability OKRs. This includes defining and tracking Service Level Objectives (SLOs) and Service Level Indicators (SLIs).
- Implement robust monitoring and alerting systems to proactively monitor health, identify potential issues, analyze system performance, and facilitate quick response to incidents.
- Implement AIOps functionality to enable auto-response, self-healing, and anomaly trend analysis.
- Drive the development and implementation of automation solutions to remove “toil”, streamline processes, reduce manual interventions, and enhance the overall efficiency of the product engineering and SRE teams.
- Identifying and addressing performance bottlenecks in applications and infrastructure to improve efficiency and user experience.
- Work closely with incident management to quickly address and resolve system outages or performance issues to minimize downtime and impact on users.
- Collaborate actively with development and operations teams to implement observability and resiliency requirements in order to ensure smooth deployment and operation of software systems.
- Lead the coordination with product, development, infrastructure, and architecture teams to conduct capacity planning, ensuring that systems can handle current and future demand; anticipate growth and scalability requirements.
- Improve reliability by identifying and addressing gaps in our architecture, services, and tooling.
- Modernize disaster recovery program for both on premise and Cloud-based Berkley solutions.
- 5+ years of IT experience working with infrastructure support and development
- 5+ years of experience of Site Reliability Engineering and Dev Ops.
- Proficient in scripting languages like Python, Go, Bash, and/or JavaScript, and experience with Shell Scripting.
- Strong expertise of observability, monitoring, alerting, and logging tools (Dynatrace, Datadog, ELK Stack)
- Practical expertise in creating and implementing logging and monitoring architectures through hands-on experience.
- Expertise in designing and implementing on-premises, cloud, and hybrid resiliency solutions (HA, AA, AP), disaster recovery, and business continuity planning.
- Deep understanding of cloud computing principles, including IaaS, PaaS, and SaaS models.
- Exper…
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).