Director of Production Engineering; Reliability Platform Engineering
Listed on 2025-12-01
-
IT/Tech
Systems Engineer, Cloud Computing, SRE/Site Reliability
Director of Production Engineering
Join to apply for the Director of Production Engineering role at Toshiba Global Commerce Solutions
. Toshiba Global Commerce Solutions is seeking a Director of Production Engineering to lead the reliability backbone of our global POS, cloud, and middleware platform. This strategic role owns system availability, resilience, performance, observability, and release reliability across a distributed, mission‑critical commerce ecosystem.
This leader will unify Site Reliability Engineering (SRE), Resilience & Performance Engineering, Observability, and AI‑driven Reliability Automation into one cohesive function. As AI accelerates development velocity, verification and reliability become the core bottlenecks—making this role a cornerstone of our engineering organization.
You will partner closely with Architecture, Cloud Operations, Functional Quality Engineering, and Software Development to ensure predictable reliability, smooth releases, and dramatically fewer Sev‑1/Sev‑2 incidents.
Responsibilities System Reliability & Uptime- Define and enforce SLO/SLA frameworks, error budgets, and release criteria
- Lead availability, resilience, and performance strategy across all services.
- Own MTTR, MTBF, incident prevention, and rollback strategies at scale.
- Lead teams across SRE & L3 Engineering, Resilience & Performance, Engineering, Observability & Telemetry, AI Reliability Automation.
- Build a culture focused on prevention over firefighting.
- Collaborate with Principal Engineers and Architects to define system guardrails, resilience patterns, and failure modes.
- Ensure high‑quality Production Readiness Reviews (PRRs) and architectural consistency.
- Own chaos, failover, load, stress, and soak testing strategies.
- Validate store‑mode behavior, payment workflows, edge‑device dependencies, and multi‑service interactions.
- Ensure complete, accurate signal for logs, traces, metrics, and business health.
- Partner with AI systems to build intelligent anomaly detection pipelines.
- Integrate AI‑based reliability scoring, resiliency prediction, automated gating, regression analysis, and incident pattern detection.
- Define the path toward autonomous release reliability pipelines.
- Partner with Software Development, Functional Quality Engineering, Cloud Operations, Architecture, and TPM/TPO teams.
- Drive multi‑team initiatives and ensure readiness across complex release trains.
- Bachelor’s Degree in Computer Science, Engineering or 10‑15 years direct experience.
- 10–15+ years in SRE, Reliability Engineering, Production Engineering, Distributed Systems, and Performance/Resilience Engineering.
- Proven ownership of uptime and system reliability in complex distributed architectures.
- Expertise in distributed systems, cloud platforms (AKS, Kubernetes), observability stacks (Open Telemetry, Grafana, App Insights, Datadog), performance tuning, fault tolerance, network fundamentals, DB/service scaling, chaos testing.
- Architectural Leadership:
Experience designing resilience patterns (timeouts, retries, hedging, circuit breakers). Strong partnership with architects and senior engineers. - Operational Maturity:
Led SRE/on‑call organizations. Defined SLOs, SLIs, and error budgets ck record of driving incident prevention culture. - Leadership & Communication:
Builds strong engineering teams and hires top talent. - Influential communicator with executives and cross‑functional teams. Highly collaborative and low‑ego.
- AI‑driven anomaly detection, regression analysis, incident clustering, reliability scoring.
- Experience with retail POS, payments, edge devices, or store environments. Hybrid cloud + edge architectures.
- Leading reliability transformations and scaling engineering organizations (200→500+).
- Uptime becomes engineered, not reactive.
- Development and QA operate at AI‑enabled speed.
- Our platform grows safely while delivering stability and performance.
- We match or surpass best‑in‑class tech organizations…
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).