Acquire Job Denver area,Colorado USA,IT/Tech

Job Description

Site Reliability Engineer About Acquire Learning Acquire Learning is a learning management platform built specifically for ABA (Applied Behavior Analysis) therapy. Clinicians and behavior technicians use Acquire every day while working with clients on the autism spectrum, and the data captured in the platform shapes real treatment decisions. We are a small, product‑focused team building in a HIPAA‑regulated environment. Reliability is not a checkbox here.

When a clinician is mid‑session with a child, the platform working and behaving predictably is the difference between productive therapy and a disrupted session. Your work will directly affect the communities we serve.

About the Role

Acquire Learning is hiring its first dedicated Site Reliability Engineer. This is a mid‑level role with a clear path to Lead SRE at Acquire as the company grows. We are looking for someone with real‑world SRE, Dev Ops, infrastructure, and production‑engineering experience who is ready to take meaningful ownership now and grow into the person responsible for how reliability works will report to the Lead Engineer and collaborate regularly with our CEO and CTO.

You will not be inheriting a mature SRE team or a thick run‑book library. You will help build them. From day one, you will be the person most focused on keeping the production environment healthy, the release pipeline trustworthy, and the reliability surface of the codebase improving, while also raising release‑quality risk clearly and acting as the customer‑facing escalation point when production issues need investigation.

This role is hands‑on across both infrastructure and code. In the early days, you will:
Run and improve our deploy pipelines, observability stack, and infrastructure as code; triage and respond to production alerts and customer‑reported issues; collaborate with engineering on the reliability and architecture areas of the codebase in Node.js and Type Script: repair scripts, migrations, observability instrumentation, index management, job lifecycle, deploy tooling, e2e and release automation; own release‑quality signals and partner with engineering on the QA tooling and test automation that protects clinicians from regressions.

This is intentionally a broad role today, by design, because Acquire needs someone who can move across infrastructure, code, and release reliability with judgment. As the company and team grow, the role narrows toward Lead SRE: setting reliability strategy, owning incident response, and shaping how operations, observability, and release engineering work at Acquire.

Where We Need Help

HIPAA‑aware operations: strengthening PHI‑safe logging, audit trails, production‑data handling, access controls, incident evidence, vendor/tooling review, and the runbooks that support a regulated healthcare environment.
Multi‑tenant architecture: helping us operate safely as Acquire grows beyond a single internal deployment, including tenant isolation, tenant‑scoped diagnostics, organization‑safe migrations, repair scripts, alerts, and support workflows.
Disaster recovery and restore confidence: improving backup verification, Atlas snapshot / point‑in‑time restore confidence, rollback procedures, break‑glass paths, and recovery drills.
Data integrity operations: making migrations, index changes, repair scripts, and production diagnostics safer to run, easier to audit, and harder to misuse under pressure.
Security and production access hygiene: tightening IAM, secrets management, least‑privilege access, deploy permissions, dependency monitoring, and the boundaries around who can touch production systems.
Incident response maturity: defining practical severity levels, SLOs, alerting standards, post‑incident follow‑up, and the difference between “known noisy” and “wake somebody up.”
Scale and cost visibility: keeping an eye on AWS and MongoDB costs, slow queries, index health, capacity planning, and performance regressions before they turn into customer‑facing problems.

What You’ll Do Production reliability and incident response

Own day‑to‑day production health across our AWS environment, MongoDB Atlas clusters, and supporting…