Director of Production Engineering; Reliability Platform Engineering
Listed on 2025-12-16
-
IT/Tech
Systems Engineer, SRE/Site Reliability, Cloud Computing
Director of Production Engineering (Reliability Platform Engineering)
Toshiba Global Commerce Solutions is seeking a Director of Production Engineering (Reliability Platform Engineering) to lead the reliability backbone of our global POS, cloud, and middleware platform. This strategic role owns system availability, resilience, performance, observability, and release reliability across a distributed, mission‑critical commerce ecosystem.
This leader will unify Site Reliability Engineering (SRE), Resilience & Performance Engineering, Observability, and AI‑driven Reliability Automation into one cohesive function. As AI accelerates development velocity, verification and reliability become the core bottlenecks—making this role a cornerstone of our engineering organization.
You will partner closely with Architecture, Cloud Operations, Functional Quality Engineering, and Software Development to ensure predictable reliability, smooth releases, and dramatically fewer Sev‑1/Sev‑2 incidents.
ResponsibilitiesSystem Reliability & Uptime:
- Define and enforce SLO/SLA frameworks, error budgets, and release criteria
- Lead availability, resilience, and performance strategy across all services.
- Own MTTR, MTBF, incident prevention, and rollback strategies at scale.
Unified Reliability Engineering Organization:
- Lead teams across SRE & L3 Engineering, Resilience & Performance
- Engineering, Observability & Telemetry, AI Reliability Automation.
- Build a culture focused on prevention over firefighting.
- Collaborate with Principal Engineers and Architects to define system guardrails, resilience patterns, and failure modes.
- Ensure high‑quality Production Readiness Reviews (PRRs) and architectural consistency.
Resilience & Performance Engineering:
- Own chaos, failover, load, stress, and soak testing strategies.
- Validate store‑mode behavior, payment workflows, edge‑device dependencies, and multi‑service interactions.
Observability & Telemetry:
- Ensure complete, accurate signal for logs, traces, metrics, and business health.
- Partner with AI systems to build intelligent anomaly detection pipelines.
AI‑Driven Release Reliability:
- Integrate AI‑based reliability scoring, resiliency prediction, automated gating, regression analysis, and incident pattern detection.
- Define the path toward autonomous release‑reliability pipelines.
- Partner with Software Development, Functional Quality Engineering, Cloud Operations, Architecture, and TPM/TPO teams.
- Drive multi‑team initiatives and ensure readiness across complex release trains.
- Bachelor’s Degree in Computer Science, Engineering or 10‑15 years direct experience.
- 10–15+ years in SRE, Reliability Engineering, Production Engineering, Distributed Systems, and Performance/Resilience Engineering.
- Proven ownership of uptime and system reliability in complex distributed architectures.
- Expertise in distributed systems, cloud platforms (AKS, Kubernetes), observability stacks (Open Telemetry, Grafana, App Insights, Datadog), performance tuning, fault tolerance, network fundamentals, DB/service scaling, chaos testing.
- Architectural Leadership:
Experience designing resilience patterns (timeouts, retries, hedging, circuit breakers). Strong partnership with architects and senior engineers. - Operational Maturity:
Led SRE/on‑call organizations. Defined SLOs, SLIs, and error budgets ck record of driving incident prevention culture. - Leadership & Communication:
Builds strong engineering teams and hires top talent. - Influential communicator with executives and cross‑functional teams. Highly collaborative and low‑ego.
- AI‑driven anomaly detection, regression analysis, incident clustering, reliability scoring.
- Experience with retail POS, payments, edge devices, or store environments.
- Hybrid cloud + edge architectures.
- Leading reliability transformations and scaling engineering organizations (200 → 500+).
As AI accelerates development velocity, the bottleneck shifts from coding to verification, reliability, and releasesafety. This role ensures:
- Uptime becomes engineered, not reactive.
- Development and QA operate at AI‑enabled speed.
- Our platform grows safely while delivering stability and…
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).