×
Register Here to Apply for Jobs or Post Jobs. X

SRE Architect, AI-Powered Reliability

Job in Portland, Cumberland County, Maine, 04122, USA
Listing for: WEX, Inc.
Full Time position
Listed on 2026-05-16
Job specializations:
  • IT/Tech
    Systems Engineer, Cloud Computing: Infrastructure & Operations
Salary/Wage Range or Industry Benchmark: 80000 - 100000 USD Yearly USD 80000.00 100000.00 YEAR
Job Description & How to Apply Below

About the Team & Role

WEX operates across multiple lines of business, Mobility, Benefits, and Travel, serving enterprise customers globally with payment and technology solutions that demand uncompromising reliability. These are mission‑critical systems handling high‑volume financial transactions where availability, transactional integrity, and low latency are non‑negotiable. Our SRE practice is in its early stages, and the decisions made now will define how we build, operate, and continuously improve reliable systems for years to come.

This person will define and enforce the reliability standards, operational practices, and architectural guardrails that every line of business at WEX must meet, and will use AI as a primary tool to establish, scale, and continuously improve those standards faster than traditional approaches alone can achieve.

This is not a role embedded in a single business unit. It sits at the center of WEX engineering with a mandate that spans all LOBs. You will set the bar, and you will hold it, working with engineering leadership, platform teams, and LOB architects to make reliability a consistent, measurable, and continuously improving property of every system we operate.

How you’ll make an impact Enterprise Standards & Governance
  • Define, publish, and enforce enterprise‑wide SRE best practices and operational standards covering observability, incident management, resilience, capacity planning, and reliability architecture, applicable across all WEX lines of business.
  • Define and lead WEX’s AI‑Powered Reliability Engineering strategy, driving adoption of SRE agents across the software lifecycle—from design and development through deployment and operations—to improve reliability, automation, and operational efficiency.
  • Architect and oversee the implementation of mission‑critical systems, ensuring that reliability, availability, and transactional integrity requirements are designed in from the start, not bolted on after the fact.
  • Establish and govern SLO, SLI, and error budget frameworks across LOBs, partnering with engineering leadership to align reliability targets with business and commercial expectations.
  • Own the production readiness review process, defining the criteria every service must meet before going live and driving accountability for remediation when gaps are found.
  • Serve as the primary technical advisor to engineering leadership across WEX on matters of reliability, resilience architecture, and operational excellence.
Observability
  • Define the enterprise observability standard, what good looks like for metrics, distributed tracing, structured logging, and alerting, and hold all LOBs accountable to it.
  • Use AI‑powered tooling to move beyond static dashboards: deploy intelligent anomaly detection, dynamic baselining, and automated signal correlation to reduce noise and surface actionable signals at scale.
  • Drive instrumentation practices that give engineering teams genuine insight into the health of high‑availability, low‑latency systems, including real‑time payment flows and transaction pipelines where latency and consistency are critical.
  • Lead the evaluation and adoption of AI‑assisted observability platforms that reason across telemetry sources to accelerate detection and diagnosis.
Incident Management
  • Establish the enterprise incident management framework: severity definitions, response playbooks, escalation paths, on‑call standards, and cross‑LOB communication protocols.
  • Integrate AI into the full incident lifecycle, intelligent triage and automated runbook suggestions at detection, real‑time signal correlation during active incidents, and AI‑assisted timeline and impact summaries at resolution.
  • Reduce cognitive burden on on‑call engineers through tooling that surfaces relevant context, prior incidents, and likely remediation paths automatically during high‑pressure situations.
  • Define, track, and report on incident metrics (MTTD, MTTR, recurrence rate) across all LOBs, using trends to drive systemic improvement rather than one‑off fixes.
Resilience Engineering & Self‑Healing Systems
  • Lead cross‑functional initiatives to enhance system resilience and performance across WEX,…
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
 
 
 
Search for further Jobs Here:
(Try combinations for better Results! Or enter less keywords for broader Results)
Location
Increase/decrease your Search Radius (miles)
0
200
Filters
Education Level
Experience Level (years)
Posted in last:
Salary