DevOps/Infrastructure Engineering Job Jamestown area,Town of Poland New York USA,IT/Tech

Position: DevOps / Infrastructure Engineering
Location: Town of Poland

Plus
10 is a technical recruitment agency with a focus on Engineering and Product professionals that build web applications using a modern stack. Plus
10 recruiters are knowledge stewards that open doors for individuals looking to progress their career.

We are working hand-in-hand with the following client to help find a Dev Ops / Infrastructure Engineering.

The client is a non‑profit organization building an autonomous AI Physicist designed to advance humanity's understanding of the fundamental laws of nature. The goal is for the AI Physicist to achieve a breakthrough that unifies quantum field theory & general relativity and to explain the deepest unresolved phenomena in our universe by 2035. They're pioneering a new approach to scientific discovery by creating an intelligent system that can explore theoretical frameworks, reason across disciplines, and generate novel insights.

The organization operates like a tech start‑up by moving quickly and continuously iterating to accelerate scientific progress. By combining AI, symbolic reasoning, and autonomous research capabilities, we develop a platform that goes beyond analyzing existing knowledge to actively contribute to physics research.

Job Description

We're seeking a Member of Technical Staff, Dev Ops / Infrastructure Engineering to architect, automate, and scale the infrastructure that underpins our large‑scale model training and research workflows. This role spans both cloud environments (AWS) and HPC infrastructure (Buzz & Lambda HPC GPU clusters with high‑speed interconnects), requiring you to design and codify the systems, pipelines, and automation that enable our researchers and engineers to move fast with confidence.

Our ideal candidate brings strong fundamentals in Unix/Linux, deep experience in CI/CD and infrastructure‑as‑code, and a systems mindset to define standards, build automation, and grow our infrastructure practice from the ground up. You'll be instrumental in building the reliable, scalable foundation that powers our autonomous AI Physicist while partnering closely with training engineers and researchers to accelerate breakthrough scientific discoveries.

Key Responsibilities Infrastructure Architecture & Automation

Design and run large‑scale pre‑training experiments for both dense and MoE architectures, from experiment planning through multi‑week production runs.
Architect hybrid infrastructure solutions that span cloud and on‑premises HPC environments seamlessly.
Automate configuration management and drift detection using tools like Ansible, Salt, or Chef.
Build systems that reduce operational toil and establish guardrails that let researchers focus on experiments, not operations.

CI/CD & Developer Experience

Build and own comprehensive CI/CD pipelines for training workflows, evaluation jobs, internal tools, and services with rollback capabilities, observability, and safety built in.
Develop tooling for developer workflows including reproducible builds, ephemer …
Create self‑service infrastructure patterns that empower researchers and engineers.
Design infrastructure that accelerates experimentation while maintaining reliability and reproducibility.

HPC & GPU Cluster Management

Manage and extend HPC environments …
Operate containerized and scheduled workloads efficiently across Docker, Kubernetes, and Slurm environments.
Optimize cluster scheduling and resource allocation for high‑performance GPU workloads.
Debug GPU driver quirks, Slurm job issues, and Infini Band networking hiccups as they arise.

Monitoring, Observability & Reliability

Implement comprehensive monitoring, logging, and alerting across all infrastructure layers using Prometheus, Grafana, ELK/EFK, and Open Telemetry.
Establish SLOs/SLIs for infrastructure reliability and create observability dashboards for long‑horizon training runs.
Build observability stacks that provide visibility into both system health and job‑level performance.
Proactively detect and resolve infrastructure issues before they impact research workflows.

Security & Compliance

Implement and manage secrets management and identity security solutions (Vault, KMS, IAM).
Champion security best practices, IAM…


Increase/decrease your Search Radius (miles)



Job Posting Language

DevOps​/Infrastructure Engineering

DevOps/Infrastructure Engineering