×
Register Here to Apply for Jobs or Post Jobs. X

Senior DevOps​/Platform Reliability Engineer

Job in Palm Coast, Flagler County, Florida, 32164, USA
Listing for: Zingtree
Full Time position
Listed on 2026-05-16
Job specializations:
  • IT/Tech
    Systems Engineer, Cloud Computing
Salary/Wage Range or Industry Benchmark: 80000 - 100000 USD Yearly USD 80000.00 100000.00 YEAR
Job Description & How to Apply Below
Position: Senior DevOps / Platform Reliability Engineer

About Zingtree

Zingtree is the next-generation intelligent process automation platform reimagining customer experience operations for the world’s top support leaders. With 500+ customers, including Optum, Corpay, Sony, Shark Ninja, and Allianz, we transform self-service, surface automation opportunities, and turn every agent into an expert.

The Role

We’re hiring a Senior Dev Ops / Platform Reliability Engineer to own the platform that powers our agentic CX product. You’ll build the CI/CD, infrastructure, and observability backbone that enables us to ship multi‑agent systems safely to enterprise customers.

If you want to operate a production AI platform and use AI to help operate it, this role is for you.

In this role, you will collaborate with development, operations, and infrastructure teams to automate and streamline processes, build and maintain tools for deployment, monitoring, and operations, and troubleshoot issues across development and production environments.

What You’ll Do
  • Own and evolve CI/CD pipelines using Git Hub Actions and OIDC‑based authentication for microservices and agentic workloads, with safe, fast, and reversible deployments.
  • Automate infrastructure provisioning using Infrastructure as Code (IaC) tools such as Terraform and Cloud Formation.
  • Operate and scale our Kubernetes platform (EKS + Argo CD), including autoscaling, ingress, external‑dns, cert‑manager, External Secrets Operator, backups, runtime guardrails, and multi‑tenant isolation for enterprise customers.
  • Manage the edge and network perimeter, including Cloudflare (CDN, WAF, Bot Management, DDoS protection, Zero Trust / Access), Cloud Front, API Gateway, ALB/NLB, Route 53, and network security controls.
  • Operate the data and event tier, including Aurora MySQL, Elasti Cache/Redis, S3, and MSK (Kafka), with responsibility for backups, point‑in‑time recovery (PITR), and multi‑AZ disaster recovery aligned to defined RTO/RPO objectives.
  • Build and maintain Lambda workloads where event‑driven or serverless architectures are the right fit.
  • Build observability as a product using Prometheus, Grafana, and Open Telemetry, including telemetry for LLM and agentic systems such as token cost, tool‑call latency, evaluation signals, and prompt/version tracking.
  • Strengthen our security and compliance posture for SOC 2 and HIPAA, including least‑privilege IAM, SCPs, secrets management, SAST/DAST, dependency and container scanning, image signing, AWS Config, Security Hub, Guard Duty, Inspector, and evidence automation.
  • Drive Fin Ops initiatives, including tagging standards, Savings Plans and Reserved Instances, per‑tenant and per‑workload cost attribution, and LLM cost controls.
  • Build and evolve our AI‑native Dev Ops capabilities.
  • Partner with engineering teams to define platform standards, service templates, deployment best practices, and operational SLOs.
  • Monitor system performance and ensure reliability, scalability, and security across infrastructure and services.
  • Collaborate with software engineering teams to support continuous integration and continuous delivery best practices.
  • Document infrastructure, deployment processes, and operational standards to support knowledge sharing across the team.
Agentic AI in Dev Ops

You’ll help define how Zingtree uses agentic AI to operate and improve our platform using modern AI operational practices.

Responsibilities include
  • Design and operate auto‑remediation agents for common production toil such as certificate rotation, noisy pods, infrastructure drift, and flaky CI pipelines, with human‑in‑the‑loop (HITL) controls for any destructive or customer‑impacting actions.
  • Use LLMs for incident triage and root cause analysis, including log and trace summarization, signal correlation, and first‑draft post‑mortems that are always reviewed by humans.
  • Connect AI agents to internal systems through the Model Context Protocol (MCP), including Git Hub, Jira, Pager Duty, AWS, Kubernetes, Terraform, and related platforms, using scoped credentials, audit logging, and allow‑listed access.
  • Apply AI‑driven observability techniques, including anomaly detection on metrics, LLM‑based log clustering, and alert deduplication and…
Position Requirements
10+ Years work experience
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
 
 
 
Search for further Jobs Here:
(Try combinations for better Results! Or enter less keywords for broader Results)
Location
Increase/decrease your Search Radius (miles)
0
200
Filters
Education Level
Experience Level (years)
Posted in last:
Salary