Sr DevOps Engineer Job Santa Clara area,California USA,IT/Tech

Job Summary

We are seeking a highly capable Senior Dev Ops Engineer / Platform Engineer to build, operationalize, and scale the infrastructure and deployment foundation for a strategic site-builder / network automation platform
. This role will focus on creating reliable CI/CD pipelines, production‑grade Kubernetes deployment patterns, managed database services, observability, environment reproducibility, secrets management, and Infrastructure as Code across development, testing, staging, and production environments. This engineer will play a critical role in moving the platform from an early‑stage, partially manual operating model into a repeatable, supportable, and production‑ready Dev Ops model. The environment includes Kubernetes‑hosted services, AWS managed services, workflow orchestration with Temporal, integration with Nautobot, Argo‑based promotion flows, and the supporting tooling required for debugging, snapshotting, local development, and production support.

This is a hands‑on engineering role for someone who can design the right platform patterns, implement them directly, and establish a durable operating model between development and Dev Ops teams.

Key Responsibilities Platform Deployment & CI/CD

Design, implement, and maintain CI/CD pipelines for testing, staging, and production environments.
Build and maintain deployment workflows that support safe and seamless promotion across environments.
Improve and maintain Argo‑based deployment workflows to enable controlled release progression from test to staging to production.
Establish baseline deployment mechanisms for the site‑builder application and related services.
Standardize Kubernetes application packaging and deployment patterns, with a strong preference toward Helm‑based lifecycle management for complex services and third‑party components.
Migrate existing deployments to Helm charts where appropriate.

Kubernetes & Runtime Platform Engineering

Support the deployment and ongoing operation of services running in Kubernetes.
Improve runtime reliability, resiliency, and troubleshooting for distributed services operating inside shared Kubernetes clusters.
Investigate and harden service‑to‑service connectivity patterns, especially for workflow components such as workers connecting to the Temporal engine.
Partner with development teams to define production‑grade runtime requirements, resource sizing, restart policies, and platform support boundaries.

Infrastructure as Code & Cloud Services

Design and implement fully declarative Infrastructure as Code for managed cloud services, especially in AWS.
Provision and maintain managed data services such as RDS/PostgreSQL and MongoDB‑compatible document databases across all environments.
Eliminate manual infrastructure setup where possible and replace it with reproducible, version‑controlled deployment patterns.
Prepare the platform for future scale across multiple environments and regions through repeatable IaC and Git Ops‑aligned practices.

Data Services, Snapshots & Developer Enablement

Setup and maintain RDS, MongoDB, Redis/cache services
, and related dependencies for all environments.
Build tooling and operational processes for:
- production and staging database snapshots,
- restoring snapshots into development environments,
- enabling local debugging and development from realistic data states.
Support creation of local and development environments, including Minikube‑based environment‑as‑code approaches that mirror production behavior as closely as practical.
Improve platform reproducibility so engineers can quickly stand up close‑to‑production development environments.

Workflow Orchestration & Temporal Support

Lead the setup, deployment, and operational support of Temporal for workflow orchestration.
Support production operations for Temporal, including troubleshooting performance issues, restarts, scaling concerns, and resource shortages.
Establish maintainable deployment patterns for Temporal using supported packaging and lifecycle management approaches.
Partner with engineering teams to ensure workflow platform reliability and upgradeability over time.

Observability, Reliability & Incident Readiness

Design and…