DevOps Engineer III Job Needham area,Massachusetts USA,IT/Tech

At Advisor
360°, we build technology that transforms how wealth management firms operate, scale, and serve their clients. As a leading SaaS platform in the fintech space, we’re trusted by some of the largest independent broker-dealers and RIAs to power the full advisor and client experience—from portfolio construction and reporting to compliance and client engagement.

What sets us apart? It's not just the tech (though it's best-in-class). It's the people, the purpose, and the passion behind everything we do.

We’re a team of builders, thinkers, and doers who believe that great companies are defined by the stories they tell and the experiences they create—internally and externally. We bring deep industry expertise, a collaborative spirit, and a commitment to innovation as we reshape what’s possible in wealth management.

As we grow, we’re looking for teammates who are ready to roll up their sleeves, think big, and help elevate our brand in a way that reflects the bold ambitions we have for our company and the clients we serve. Join us, and be part of a company that's not only moving fast—but making it count.

This role is hybrid, requiring three days per week onsite in our Needham, MA headquarters.

At Advisor
360°, our Agentic AI team is building the platform layer that makes AI systems truly production-ready—and we’re already live in production. This isn’t a greenfield initiative; it’s a high‑impact environment where real systems are running at scale today.

As a Dev Ops Engineer, you’ll own the infrastructure that powers these systems. Working hands‑on with Kubernetes, Git Ops, and ArgoCD, you’ll design and operate the deployment framework that enables multiple teams to ship independently and efficiently. You’ll play a critical role in establishing operational standards, ensuring reliability, and building the foundation that allows AI‑driven workflows to execute with confidence at scale.

Here’s

What You’ll Do:

Cluster operations on AKS: node pool sizing, autoscaling policies, namespace isolation, network policies, and day‑two operational hygiene across environments.
Git Ops delivery pipeline using ArgoCD: app-of-apps structure, environment promotion, rollback strategy, and guardrails that prevent one team’s bad deploy from cascading.
Deployment strategies: blue‑green, canary, and rolling release patterns for agentic services where a bad rollout has downstream effects on active workflows.
Security posture: RBAC, Azure AD Workload Identity, network policies, secrets management via Key Vault, and policy‑as‑code enforcement with OPA/Gatekeeper.
Platform reliability: SLIs, SLOs, alerting, and runbooks for the infra layer. When something breaks at 2 am, you write the playbook.
Developer experience: reduce the toil that slows down six feature teams. The right self‑service primitives mean engineers spend time building skills, not waiting on infra tickets.
Cost and capacity management: LLM workloads have spiky, non‑linear cost profiles. You’ll instrument and enforce budgets, quotas, and rightsizing across the cluster.

What You Bring to the Table:

5+ years operating Kubernetes in production.
Hands‑on Git Ops experience with ArgoCD: multi‑environment setups, Application Sets, sync waves, health checks, and rollback under pressure.
Azure fluency: AKS, ACR, Azure Monitor, Key Vault, managed identity, workload identity, networking.
Infrastructure‑as‑code as a default:
Terraform for everything, no console cowboys.
Scripting in Python, Go, or Bash for automation and tooling — not one‑offs, maintained code.
Strong incident response instincts. You've been on‑call, written postmortems, and fixed the underlying conditions rather than just the symptom.
Experience running LLM inference infrastructure or API gateway patterns for AI workloads.
Familiarity with agentic AI frameworks (Lang Graph, Auto Gen, or similar) and the infrastructure patterns they require.
OPA/Gatekeeper or other policy‑as‑code tooling for cluster governance at scale.
Open Telemetry and distributed tracing across non‑trivial service meshes.
Service mesh experience (Istio or Linkerd) for service‑to‑service auth and traffic management.
CKA or CKS certification.
Prior work on…