×
Register Here to Apply for Jobs or Post Jobs. X

DevOps​/Infrastructure Engineering

Job in Town of Poland, Jamestown, Chautauqua County, New York, 14701, USA
Listing for: Plus10 Recruitment
Full Time position
Listed on 2025-12-28
Job specializations:
  • IT/Tech
    Systems Engineer, Cloud Computing
Salary/Wage Range or Industry Benchmark: 100000 - 150000 USD Yearly USD 100000.00 150000.00 YEAR
Job Description & How to Apply Below
Position: DevOps / Infrastructure Engineering
Location: Town of Poland

Plus
10 is a technical recruitment agency with a focus on Engineering and Product professionals that build web applications using a modern stack. Plus
10 recruiters are knowledge stewards that open doors for individuals looking to progress their career.

We are working hand-in-hand with the following client to help find a Dev Ops / Infrastructure Engineering.

The client is a non‑profit organization building an autonomous AI Physicist designed to advance humanity's understanding of the fundamental laws of nature. The goal is for the AI Physicist to achieve a breakthrough that unifies quantum field theory & general relativity and to explain the deepest unresolved phenomena in our universe by 2035. They're pioneering a new approach to scientific discovery by creating an intelligent system that can explore theoretical frameworks, reason across disciplines, and generate novel insights.

The organization operates like a tech start‑up by moving quickly and continuously iterating to accelerate scientific progress. By combining AI, symbolic reasoning, and autonomous research capabilities, we develop a platform that goes beyond analyzing existing knowledge to actively contribute to physics research.

Job Description

We're seeking a Member of Technical Staff, Dev Ops / Infrastructure Engineering to architect, automate, and scale the infrastructure that underpins our large‑scale model training and research workflows. This role spans both cloud environments (AWS) and HPC infrastructure (Buzz & Lambda HPC GPU clusters with high‑speed interconnects), requiring you to design and codify the systems, pipelines, and automation that enable our researchers and engineers to move fast with confidence.

Our ideal candidate brings strong fundamentals in Unix/Linux, deep experience in CI/CD and infrastructure‑as‑code, and a systems mindset to define standards, build automation, and grow our infrastructure practice from the ground up. You'll be instrumental in building the reliable, scalable foundation that powers our autonomous AI Physicist while partnering closely with training engineers and researchers to accelerate breakthrough scientific discoveries.

Key Responsibilities Infrastructure Architecture & Automation
  • Design and run large‑scale pre‑training experiments for both dense and MoE architectures, from experiment planning through multi‑week production runs.
  • Architect hybrid infrastructure solutions that span cloud and on‑premises HPC environments seamlessly.
  • Automate configuration management and drift detection using tools like Ansible, Salt, or Chef.
  • Build systems that reduce operational toil and establish guardrails that let researchers focus on experiments, not operations.
CI/CD & Developer Experience
  • Build and own comprehensive CI/CD pipelines for training workflows, evaluation jobs, internal tools, and services with rollback capabilities, observability, and safety built in.
  • Develop tooling for developer workflows including reproducible builds, ephemer …
  • Create self‑service infrastructure patterns that empower researchers and engineers.
  • Design infrastructure that accelerates experimentation while maintaining reliability and reproducibility.
HPC & GPU Cluster Management
  • Manage and extend HPC environments …
  • Operate containerized and scheduled workloads efficiently across Docker, Kubernetes, and Slurm environments.
  • Optimize cluster scheduling and resource allocation for high‑performance GPU workloads.
  • Debug GPU driver quirks, Slurm job issues, and Infini Band networking hiccups as they arise.
Monitoring, Observability & Reliability
  • Implement comprehensive monitoring, logging, and alerting across all infrastructure layers using Prometheus, Grafana, ELK/EFK, and Open Telemetry.
  • Establish SLOs/SLIs for infrastructure reliability and create observability dashboards for long‑horizon training runs.
  • Build observability stacks that provide visibility into both system health and job‑level performance.
  • Proactively detect and resolve infrastructure issues before they impact research workflows.
Security & Compliance
  • Implement and manage secrets management and identity security solutions (Vault, KMS, IAM).
  • Champion security best practices, IAM…
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
 
 
 
Search for further Jobs Here:
(Try combinations for better Results! Or enter less keywords for broader Results)
Location
Increase/decrease your Search Radius (miles)

Job Posting Language
Employment Category
Education (minimum level)
Filters
Education Level
Experience Level (years)
Posted in last:
Salary