Sr. DevOps Platform Engineer Job Wilmington area,Delaware USA,IT/Tech

Role Overview

As a Senior Dev Ops Platform Engineer, you will play a critical role in ensuring the reliability, scalability, security, and performance of Berkley's software systems. You will collaborate closely with product engineering, infrastructure, and architecture teams to build, mature, and operate an enterprise Dev Ops platform that enables teams to deliver software safely, efficiently, and s role blends Dev Ops platform engineering and SRE practices, with a focus on CI/CD, observability, automation, and reliability across both cloud and on‑premises environments.

Top

Skills

CI/CD Expertise:
Advanced Git Hub Actions with integrated security and quality checks (Sonar Qube, Veracode) and modern deployment strategies (blue/green).
Hybrid Infrastructure:
Experience on‑prem (Windows/Linux) and Azure, supporting cloud transition.
Infrastructure as Code:
Proficiency in Terraform, automation and working knowledge of Python.
Kubernetes (Deployment‑Focused):
Building pipelines for on‑prem Kubernetes and AKS without administrative requirements.
Cross‑Functional Communication:
Acts as a bridge between engineering and business stakeholders.
Leadership Potential:
Hands‑on player‑coach able to drive Dev Ops transformation and grow into a lead role.
Fin Ops Mindset:
Focus on cost control and data visualization for cloud costs (e.g., Git Hub Copilot).

Responsibilities

Maintain a strong understanding of the entire technology stack to design, observe, troubleshoot, and automate systems across the Berkley environment.
Design, build, and mature enterprise CI/CD pipelines and shared Dev Ops platform services, enabling secure, reliable, and scalable software delivery for multiple teams.
Define, implement, and track reliability and observability OKRs, including SLIs and SLOs, to guide reliability engineering, deployment practices, and operational decision‑making.
Implement and evolve monitoring, alerting, and observability solutions, including AIOps capabilities, to proactively assess system health, detect anomalies, enable self‑healing, and support rapid incident response.
Drive automation initiatives to eliminate operational toil, streamline platform and pipeline workflows, reduce manual intervention, and improve efficiency for product engineering and SRE teams.
Identify and address performance, scalability, and reliability bottlenecks across applications, infrastructure, and delivery pipelines to improve system efficiency and user experience.
Partner with incident management and operations teams to respond to, resolve, and prevent system outages or degradation, minimizing downtime and customer impact.
Collaborate actively with development, operations, and platform teams to embed resiliency, observability, security, and reliability requirements into system design, CI/CD pipelines, and runtime environments.
Lead cross‑functional coordination with product, development, infrastructure, and architecture teams to perform capacity planning, anticipate growth, and ensure systems scale reliably with business demand.
Continuously improve platform resilience by identifying and closing gaps in architecture, tooling, processes, and operational practices.
Modernize and strengthen disaster recovery capabilities for both on‑premises and cloud‑based Berkley solutions, ensuring recoverability, resilience, and compliance with enterprise standards.

Skills

Git Hub, CI/CD, Cloud, Infrastructure Engineering, On‑prem, Dev Ops.

Additional

Skills & Qualifications

5+ years of Dev Ops and Site Reliability Engineering experience with hand‑son ownership of infrastructure, CI/CD platforms, and software delivery in enterprise environments.
Strong software engineering and automation skills, including proficiency in Python, Go, Bash, or JavaScript, and experience building production‑grade automation.
Proven expertise in enterprise CI/CD, Git Ops, and containerized platforms, including Kubernetes, Helm, and cloud‑native delivery patterns.
Deep experience with reliability and observability, including monitoring, alerting, logging, and tracing platforms (Dynatrace, Datadog, ELK), and defining SLIs, SLOs, and reliability metrics.
Strong understanding of cloud,…