×
Register Here to Apply for Jobs or Post Jobs. X

Senior AI Infrastructure Engineer

Remote / Online - Candidates ideally in
Portland, Multnomah County, Oregon, 97204, USA
Listing for: SupportFinity™
Full Time, Remote/Work from Home position
Listed on 2026-02-16
Job specializations:
  • IT/Tech
    AI Engineer, Cloud Computing, Systems Engineer, Data Engineer
Salary/Wage Range or Industry Benchmark: 121500 - 145500 USD Yearly USD 121500.00 145500.00 YEAR
Job Description & How to Apply Below

Senior AI Infrastructure Engineer

Company: WEX

Location: Boston, MA (Remote, must reside within 30 miles of Portland, ME; Boston, MA; Chicago, IL; Dallas, TX; San Francisco Bay Area, CA; Seattle/WA)

Salary: $121.50K - $145.50K/yr

Type: Full-time

Benefits: Medical, Dental, Vision, Life, Retirement, PTO

Posted: 15 hours ago

About The Team

We are the backbone of the AI organization, building the high‑performance compute foundation that powers our generative AI and machine learning initiatives. Our team bridges the gap between hardware and software, ensuring that our researchers and data scientists have a reliable, scalable, and efficient platform to train and deploy models. We focus on maximizing GPU utilization, minimizing inference latency, and creating a seamless "paved road" for AI development.

How

You'll Make An Impact

You are a systems thinker who loves solving hard infrastructure challenges. You will architect the underlying platform that serves our production AI workloads, ensuring they are resilient, secure, and cost‑effective. By optimizing our compute layer and deployment pipelines, you will directly accelerate the velocity of the entire AI product team, transforming how we deliver AI at scale.

Responsibilities
  • Platform Architecture:
    Design and maintain a robust, Kubernetes‑based AI platform that supports distributed training and high‑throughput inference serving.
  • Inference Optimization:
    Engineer low‑latency serving solutions for LLMs and other models, optimizing engines (e.g., vLLM, TGI, Triton) to maximize throughput and minimize cost per token.
  • Compute Orchestration:
    Manage and scale GPU clusters on cloud (AWS) or on‑prem environments, implementing efficient scheduling, auto‑scaling, and spot instance management to optimize costs.
  • Operational Excellence (MLOps):
    Build and maintain "Infrastructure as Code" (Terraform/Ansible) and CI/CD pipelines to automate the lifecycle of model deployments and infrastructure provisioning.
  • Reliability & Observability:
    Implement comprehensive monitoring (Prometheus, Grafana) for GPU health, model latency, and system resource usage; lead incident response for critical AI infrastructure.
  • Developer

    Experience:

    Create tools and abstraction layers (SDKs, CLI tools) that allow data scientists to self‑serve compute resources without managing underlying infrastructure.
  • Security & Compliance:
    Ensure all AI infrastructure meets strict security standards, handling sensitive data encryption and access controls (IAM, VPCs) effectively.
Experience You'll Bring
  • 5+ years of experience in Dev Ops, Site Reliability Engineering (SRE), or Platform Engineering, with at least 2 years focused on Machine Learning infrastructure.
  • Production Expertise:
    Proven track record of managing large‑scale production clusters (Kubernetes) and distributed systems.
  • Hardware Fluency:
    Deep understanding of GPU architectures (NVIDIA A100/H100), CUDA drivers, and networking requirements for distributed workloads.
  • Serving Proficiency:
    Experience deploying and scaling open‑source LLMs and embedding models using containerized solutions.
  • Automation First:
    Strong belief in "Everything as Code"—you automate toil wherever possible using Python, Go, or Bash.
Technical Skills
  • Core Engineering:
    Expert proficiency in Python and Go; comfortable digging into lower‑level system performance.
  • Orchestration & Containers:
    Mastery of Kubernetes (EKS/GKE), Helm, Docker, and container runtimes. Experience with Ray or Slurm is a huge plus.
  • Infrastructure as Code:
    Advanced skills with Terraform, Cloud Formation, or Pulumi.
  • Model Serving:
    Hands‑on experience with serving frameworks like Triton Inference Server, vLLM, Text Generation Inference (TGI), or Torch Serve.
  • Cloud Platforms:
    Deep expertise in AWS (EC2, EKS, Sage Maker) or GCP, specifically regarding GPU instance types and networking.
  • Observability:
    Proficiency with Prometheus, Grafana, Data Dog, and tracing tools (Open Telemetry).
  • Networking:
    Understanding of service mesh (Istio), load balancing, and high‑performance networking (RPC, gRPC).
#J-18808-Ljbffr
Position Requirements
10+ Years work experience
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
 
 
 
Search for further Jobs Here:
(Try combinations for better Results! Or enter less keywords for broader Results)
Location
Increase/decrease your Search Radius (miles)

Job Posting Language
Employment Category
Education (minimum level)
Filters
Education Level
Experience Level (years)
Posted in last:
Salary