Senior DevOps/Platform Reliability Engineer
Listed on 2026-05-16
-
IT/Tech
Systems Engineer, Cloud Computing
About Zingtree
Zingtree is the next-generation intelligent process automation platform reimagining customer experience operations for the world’s top support leaders. With 500+ customers, including Optum, Corpay, Sony, Shark Ninja, and Allianz, we transform self-service, surface automation opportunities, and turn every agent into an expert.
The RoleWe’re hiring a Senior Dev Ops / Platform Reliability Engineer to own the platform that powers our agentic CX product. You’ll build the CI/CD, infrastructure, and observability backbone that enables us to ship multi‑agent systems safely to enterprise customers.
If you want to operate a production AI platform and use AI to help operate it, this role is for you.
In this role, you will collaborate with development, operations, and infrastructure teams to automate and streamline processes, build and maintain tools for deployment, monitoring, and operations, and troubleshoot issues across development and production environments.
What You’ll Do- Own and evolve CI/CD pipelines using Git Hub Actions and OIDC‑based authentication for microservices and agentic workloads, with safe, fast, and reversible deployments.
- Automate infrastructure provisioning using Infrastructure as Code (IaC) tools such as Terraform and Cloud Formation.
- Operate and scale our Kubernetes platform (EKS + Argo CD), including autoscaling, ingress, external‑dns, cert‑manager, External Secrets Operator, backups, runtime guardrails, and multi‑tenant isolation for enterprise customers.
- Manage the edge and network perimeter, including Cloudflare (CDN, WAF, Bot Management, DDoS protection, Zero Trust / Access), Cloud Front, API Gateway, ALB/NLB, Route 53, and network security controls.
- Operate the data and event tier, including Aurora MySQL, Elasti Cache/Redis, S3, and MSK (Kafka), with responsibility for backups, point‑in‑time recovery (PITR), and multi‑AZ disaster recovery aligned to defined RTO/RPO objectives.
- Build and maintain Lambda workloads where event‑driven or serverless architectures are the right fit.
- Build observability as a product using Prometheus, Grafana, and Open Telemetry, including telemetry for LLM and agentic systems such as token cost, tool‑call latency, evaluation signals, and prompt/version tracking.
- Strengthen our security and compliance posture for SOC 2 and HIPAA, including least‑privilege IAM, SCPs, secrets management, SAST/DAST, dependency and container scanning, image signing, AWS Config, Security Hub, Guard Duty, Inspector, and evidence automation.
- Drive Fin Ops initiatives, including tagging standards, Savings Plans and Reserved Instances, per‑tenant and per‑workload cost attribution, and LLM cost controls.
- Build and evolve our AI‑native Dev Ops capabilities.
- Partner with engineering teams to define platform standards, service templates, deployment best practices, and operational SLOs.
- Monitor system performance and ensure reliability, scalability, and security across infrastructure and services.
- Collaborate with software engineering teams to support continuous integration and continuous delivery best practices.
- Document infrastructure, deployment processes, and operational standards to support knowledge sharing across the team.
You’ll help define how Zingtree uses agentic AI to operate and improve our platform using modern AI operational practices.
Responsibilities include- Design and operate auto‑remediation agents for common production toil such as certificate rotation, noisy pods, infrastructure drift, and flaky CI pipelines, with human‑in‑the‑loop (HITL) controls for any destructive or customer‑impacting actions.
- Use LLMs for incident triage and root cause analysis, including log and trace summarization, signal correlation, and first‑draft post‑mortems that are always reviewed by humans.
- Connect AI agents to internal systems through the Model Context Protocol (MCP), including Git Hub, Jira, Pager Duty, AWS, Kubernetes, Terraform, and related platforms, using scoped credentials, audit logging, and allow‑listed access.
- Apply AI‑driven observability techniques, including anomaly detection on metrics, LLM‑based log clustering, and alert deduplication and…
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).