Senior Platform Engineer Job Boston area,Massachusetts USA,IT/Tech

About us:

Axiomatic AI is building a new class of AI systems designed to reason with the rigor of the scientific method. By combining deep learning with formal logic and physics-based modeling, we create verifiable, interpretable AI systems that collaborate with and support human researchers in high-stakes scientific and engineering workflows.

Our mission, 30×30, is to deliver a 30× improvement in the speed, accessibility, and cost of semiconductor and photonic hardware development by 2030.

We aim to revolutionize hardware design and simulation in these industries and are building a team of highly motivated professionals to bring these innovations from research into commercial products.

Position Overview

As a Senior Platform Engineer at Axiomatic, you will own the reliability, deployment, and operational excellence of our AI platform. This role focuses primarily on infrastructure, CI/CD, and operations, with additional responsibilities for automation and tooling development.

You will:

Lead deployment strategies and CI/CD pipelines across multiple environments
Architect and maintain multi-cloud infrastructure (Azure, AWS, GCP) and on-premise deployments
Own infrastructure as code using Terraform to automate provisioning and configuration
Build comprehensive observability systems: monitoring, metrics, logging, and alerting
Implement security controls, compliance frameworks, and data governance policies
Develop automation tools, APIs, and scripts (Python) to improve operational efficiency
Ensure system reliability, performance, and scalability
Drive incident response, postmortems, and continuous improvement
Troubleshoot infrastructure and application issues across multiple environments.

Your mission

Deployment & CI/CD

Design and implement deployment pipelines for multi-environment releases (dev, staging, production)
Own the full deployment lifecycle: build, test, release, and rollback strategies
Implement blue-green deployments, canary releases, and progressive rollouts
Build automated deployment tooling and workflows
Ensure zero-downtime deployments and rollback capabilities
Optimize build and deployment performance
Manage artifact repositories and container registries

Infrastructure & Cloud Operations

Design and operate multi-cloud infrastructure across Azure, AWS, and GCP
Architect and deploy on-premise solutions for enterprise customers (Linux-based)
Manage Kubernetes clusters, container orchestration, and networking
Implement disaster recovery, backup strategies, and business continuity
Optimize cloud costs and resource utilization
Define and track SLIs, SLOs, and error budgets for critical services

Infrastructure as Code

Write and maintain Terraform modules for infrastructure provisioning
Implement Git Ops workflows for infrastructure changes
Automate infrastructure scaling, updates, and operations
Ensure reproducible and version-controlled infrastructure

Observability & Monitoring

Design comprehensive monitoring, logging, and alerting (Prometheus, Grafana, Datadog, or similar)
Build dashboards for system health, performance, and business metrics
Implement distributed tracing for microservices
Conduct capacity planning and performance analysis
Drive reliability improvements through data-driven insights

Security & Compliance

Implement security best practices: identity management, secrets management, network policies
Work towards or maintain security certifications (SOC 2, ISO 27001, or similar)
Conduct security audits and vulnerability remediation
Implement data governance policies for AI pipelines and user data
Ensure compliance with data privacy regulations (GDPR, CCPA)

Automation & Tooling Development

Write automation scripts and tools in Python for operational tasks
Build internal tooling for deployments, monitoring, and incident response
Develop runbooks, automation, and self-healing systems
Create APIs for infrastructure operations when needed
Maintain high code quality and testing standards for tooling

Reliability & Incident Management

Participate in on-call rotation and lead incident response
Conduct blameless postmortems and drive action items
Build and maintain incident response playbooks
Improve system resilience and failure modes

Coll…