Platform Engineer; Remote
New Orleans, Orleans Parish, Louisiana, 70121, USA
Listed on 2026-06-04
-
IT/Tech
Systems Engineer, Cloud Computing
Radimal is a veterinary radiology and AI diagnostics platform delivering 24/7 imaging insights to hospitals nationwide. We combine board-certified radiologists with advanced AI to support real-time clinical decision-making for patients when it matters most.
Our platform spans high-throughput medical imaging, GPU-backed inference, global distribution, and enterprise-grade reliability. As we scale, we’re investing in senior platform ownership to make the system safer, more predictable, and easier for engineers to build on.
Why This Role Exists
Radimal has grown quickly. While the platform is working, operational ownership has been too diffuse. Reliability, on-call clarity, and platform standards need a single senior owner who can reduce noise, establish guardrails, and make the system more predictable as we scale.
This role exists to bring focus, ownership, and calm to the platform layer.
The Role
We’re hiring a Staff Platform Engineer to own the technical foundations that enable Radimal’s engineering teams to move quickly and reliably.
This is a senior, hands-on role with real accountability for platform architecture, infrastructure, and production systems. You’ll own Dev Ops, reliability, and on-call systems, with authority to investigate and diagnose issues across the full stack.
You will not be expected to do everything cess comes from establishing ownership, setting priorities, and making the system more predictable over time.
A core part of this role is operational containment and reliability ownership. You’ll reduce operational burden on product and AI teams by owning platform standards, tooling, and reliability so others can focus on building.
You’ll work closely with the CEO and VP of Engineering on platform strategy, architectural tradeoffs, and operational risk, while maintaining clear ownership of production systems.
This role is for someone who wants true ownership and influence through execution, not advisory distance.
What You Will Own
Own the core platform foundations that support all product and AI development
Build shared infrastructure, libraries, and patterns that make it easier to ship safely
Establish clear interfaces and ownership boundaries so teams can move independently
Improve developer experience through better CI/CD, local tooling, and observability
Raise the overall operational maturity of the engineering organization
Infrastructure and Cloud
Own and evolve Radimal’s AWS and Terraform footprint
Lead deployments across ECS, Fargate, EC2, containerized services, and GPU workloads
Manage and improve workloads running on Render and Modal
Make architectural decisions for scale, reliability, and cost efficiency
Reduce operational burden on product and AI engineers by owning reliability and tooling
Create guardrails that increase safety without slowing development
Enable engineers to self-serve infrastructure and diagnostics where appropriate
Reliability, On-Call, and Operations
Own production uptime, SLOs, and operational health
Design and own on-call coverage and escalation models
Serve as senior escalation during incidents while building systems that minimize the need for escalation
Lead incident response and post-incident reviews with clear accountability
Eliminate ambiguity around who owns production at all times
Over time, success in this role means fewer incidents, fewer escalations, and a platform that largely runs without heroics.
Observability and Performance
Operate and extend Grafana and Prometheus monitoring stacks
Improve alerting, diagnostics, and operational visibility
Build high-availability and fault-tolerant architectures
Implement caching, CDN, and performance strategies for global scale
Investigate production issues across infrastructure, backend services, data pipelines, AI inference workflows, and frontend behavior
Trace request flow end to end across Graph
QL APIs, Python services, and React applications
Read and debug React code as needed to understand client-side behavior and API usage
Form and test hypotheses during incidents to drive fast, accurate resolution
Know when to dive deep personally and when to pull in specialists
ML Ops and AI Platform Support
Understand ML Ops fundamentals…
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).