Senior DevOps Engineer, Infrastructure & Reliability
Listed on 2026-02-21
-
IT/Tech
Systems Engineer, Cloud Computing
Worth AI, a leader in the computer software industry, is looking for a Senior Dev Ops Engineer to join our Infrastructure team with a singular mission: to make our systems faster, more reliable, and more resilient while making life dramatically easier for engineers shipping software. In this role, you won’t just manage infrastructure; you will design and evolve the foundation that every product and engineer depends on.
You will act as a force multiplier by eliminating operational friction, automating repetitive processes, strengthening system reliability, and building scalable infrastructure patterns that allow teams to deploy confidently and recover quickly. You are part architect, part reliability engineer, and part automation evangelist.
Responsibilities- Conduct regular interviews with engineering teams to identify operational pain points in CI/CD, deployments, observability, and cloud environments and proactively eliminate them.
- Design and implement scalable Infrastructure-as-Code patterns using tools like Terraform to standardize cloud provisioning and reduce configuration drift.
- Own and evolve our Kubernetes platform (EKS or self-managed), ensuring workloads are secure, scalable, and resilient by default.
- Architect and optimize CI/CD pipelines to improve deployment frequency, reduce lead time, and increase confidence in releases.
- Lead systemic reliability initiatives, including incident response improvements, root cause analysis practices, and postmortem frameworks.
- Design and enforce secure networking, IAM, and secrets management strategies across environments.
- Improve observability by refining metrics, logs, and tracing using tools like Data Dog, ensuring actionable insight into system health.
- Optimize cloud cost efficiency through rightsizing, autoscaling strategies, and architectural improvements.
- Own disaster recovery planning, backup strategies, and multi-region resilience initiatives.
- Refactor brittle or manually managed infrastructure into automated, testable, and reproducible systems.
- Introduce new infrastructure tooling or architectural shifts and drive adoption through documentation, workshops, and hands‑on support.
- Lead by example in incident management, risk mitigation, and operational excellence.
- Communicate technical trade‑offs clearly across engineering and product stakeholders, balancing speed with safety.
Technology Stack
- Cloud &
Infrastructure: AWS (EKS, RDS, MSK, S3, Lambda, IAM, VPC)
Containerization & Orchestration:
Kubernetes, ArgoCD
Infrastructure-as-Code:
Terraform
CI/CD:
Git Hub Actions (or equivalent)
Monitoring & Observability:
Data Dog
Data & Messaging:
Postgre
SQL, Kafka, Redis
Languages (as needed):
Bash, Python, Type Script
- 8+ years of experience in Dev Ops, SRE, or Infrastructure Engineering roles.
- Proven experience designing and operating production Kubernetes environments at scale.
- Deep hands‑on expertise with AWS infrastructure and cloud networking.
- Strong experience building and maintaining Terraform modules across large cloud environments.
- Demonstrated ownership of CI/CD systems and measurable improvement of DORA metrics.
- Experience leading incident response processes and driving meaningful postmortem outcomes.
- Strong understanding of distributed systems, event‑driven architectures (Kafka), and database performance (Postgre
SQL). - Proven ability to modernize legacy infrastructure and eliminate manual operational toil.
- Experience navigating high‑ambiguity environments and translating operational friction into prioritized infrastructure roadmaps.
- Demonstrated ability to build trust across teams while raising the reliability bar.
Success Metrics
- DORA Metrics Improvement:
- Drive measurable improvements in Deployment Frequency, Lead Time for Changes, Change Failure Rate, and Mean Time to Recovery (MTTR).
- System Reliability:
- Maintain or exceed defined SLO/SLA targets with reduced incident frequency and duration.
- Infrastructure Stability:
- Reduce production incidents caused by misconfiguration, manual processes, or infrastructure drift.
- Operational Efficiency:
- Increase percentage of infrastructure managed through code and automation.
- Cost Optimization:
- Improve cloud cost efficiency without…
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).