×
Register Here to Apply for Jobs or Post Jobs. X

AI Engineer - Infrastructure

Job in New York, New York County, New York, 10261, USA
Listing for: Traversal
Full Time position
Listed on 2026-01-17
Job specializations:
  • IT/Tech
    Systems Engineer, Cloud Computing
Salary/Wage Range or Industry Benchmark: 150000 - 300000 USD Yearly USD 150000.00 300000.00 YEAR
Job Description & How to Apply Below
Location: New York

Traversal is the AI Site Reliability Engineer (SRE) for the enterprise—already trusted by some of the largest companies in the world to troubleshoot, remediate, and even prevent the most complex production incidents. Our mission is to free engineers from endless firefighting and enable them to focus on creative, high‑impact work.

Our roots remain deeply embedded in AI research, and we’re channeling that scientific rigor and creativity into building the premier AI agent lab for the enterprise. Hence, what we’re proudest of is assembling the most talented yet nicest group of individuals, including researchers from MIT, Harvard, and Berkeley, to world‑class engineers from industry:
Citadel Securities, Cockroach Labs, Datadog, DE Shaw, Service Now, Glean, Perplexity, Pinecone, and more, to take on one of the hardest problems for AI to solve. Without the entire team, none of this would be possible.

The Role

As an AI Infrastructure Engineer on the Platform / Reliability team, you’ll design, secure, and operate the core systems that power Traversal’s AI products. We already serve Fortune 50 enterprises with multi‑tenancy and SOC 2 Type II controls, and we’re rapidly scaling.

You’ll focus on high‑concurrency inference, Kafka data pipelines, and agentic tooling (via MCP) — building infrastructure that’s reliable under extreme load. This includes safe concurrency, graceful retries, queue management, autoscaling, observability, and Kubernetes‑native scheduling.

This is a senior, high‑impact role: you’ll own foundational systems, work across Python, Rust, Kubernetes, and Kafka, and shape how enterprise AI reliability is built and scaled.

Responsibilities
  • System Design & Architecture:
    Design scalable, reliable infrastructure for AI inference, data pipelines, and agentic workflows.
  • Queue & Job Scheduling (K8s-native):
    Migrate from Python multiprocessing + Postgres‑as‑queue to Kubernetes‑native queuing and orchestration (KEDA/HPA, Jobs/Cron Jobs, Kueue/Argo).
  • Managed Kafka Operations:
    Tune partitioning and throughput, design DLQ + replay runbooks, implement idempotent sinks to avoid duplicates.
  • Autoscaling:
    Scale on real signals (queue lag, in‑flight requests, latency); add burst capacity and safe drains.
  • Per‑Tool Reliability:
    Productionize MCP tool chains with circuit breaking, timeouts, sandboxing, and audit.
  • Progressive Delivery:
    Implement canary and blue/green rollouts for stateful services, pre‑warm caches/weights, and enable graceful termination.
  • Infrastructure as Code:
    Evolve Terraform/Helm/Kustomize for multi‑environment deployments, secrets, policy‑as‑code (OPA/Rego), and workload identity.
Requirements
  • 3+ years of experience at technically rigorous companies or teams.
  • Proven experience operating high‑concurrency backends with managed Kafka fan‑in/out and at‑least‑once processing.
  • Production experience building and maintaining systems in Python and Rust (Rust 2024).
  • Familiarity with AWS, EKS, Terraform, Helm/Kustomize.
  • Strong debugging skills across runtime, Kafka, network, and auth layers.
  • Security‑minded, with experience implementing least privilege, default‑deny egress, auditability, and policy‑as‑code.
Nice to Have
  • GPU workload operations (MIG, topology‑aware placement), inference servers, token streaming gateways.
  • Data governance (PII discovery/redaction), lineage, tokenization.
  • Cross‑region active/active for Kafka and stateless services.
  • Service mesh (Envoy/Istio), Cilium/eBPF, Click House for analytics.
Compensation

We offer competitive compensation, startup equity, health insurance, and additional benefits. The U.S. base salary range for this full‑time, in‑person role in New York is $150,000–$300,000, plus equity and benefits. Our salary ranges are based on location, level, and role. Individual compensation is determined by experience, skills, and job‑related knowledge.

Why You Should Join Us

We’ll make sure you’re fully supported with health insurance, a great tech setup, flexible time off, and plenty of in‑office snacks. We offer competitive salary and equity packages, and take thoughtful consideration with every hire on our small, high‑impact team.

Traversal is fully in‑office, 5 days a week, based in New York near Madison Square Park. We have a collaborative, hard‑working culture and are energized by building the future of AI‑powered software maintenance.

Working here means owning meaningful parts of the product, having the flexibility to move fast, and learning constantly. This is a place to grow your career, make a real impact, and help define a new category of infrastructure software.

As set forth in Traversal’s Equal Employment Opportunity policy, we do not discriminate on the basis of any protected group status under any applicable law.

#J-18808-Ljbffr
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
 
 
 
Search for further Jobs Here:
(Try combinations for better Results! Or enter less keywords for broader Results)
Location
Increase/decrease your Search Radius (miles)

Job Posting Language
Employment Category
Education (minimum level)
Filters
Education Level
Experience Level (years)
Posted in last:
Salary