Principal SRE; AI Enablement Platform Job Jamestown area,Town of Poland New York USA,IT/Tech

Position: Principal SRE (AI Enablement Platform)-2
Location: Town of Poland

Join ABC Fitness, the leading technology provider for the fitness industry!

What You’ll Do

Architect and evolve core platform capabilities for reliability, including execution environments, CI/CD systems, and validation pipelines that support high-throughput, machine-assisted change.
Design and implement fast, isolated execution environments where generated work can be built, tested, and safely discarded at scale.
Transform CI/CD into a validation system by embedding automated verification (tests, integration harnesses, canarying, rollback signals) into promotion decisions.
Build production-like validation environments that allow realistic system behavior testing without impacting live systems.
Establish deep observability patterns for autonomous workflows, including tracing what ran, what failed, why, and what it cost across agents, tools, and orchestration layers.
Define and implement guardrails-as-code, including access controls, policy enforcement, cost protections, and auditability for platform usage.
Design for reliability from day one, including scalability, fault tolerance, performance optimization, and operational resilience.
Lead technical design reviews and influence platform and infrastructure decisions across engineering teams.
Define and document reusable infrastructure patterns, platform standards, and reference implementations that create a consistent paved path for teams.

What This Is Not

Not a ticket queue or generic support role.
Not incremental-only ops without ownership of architecture and adoption.
Not "just Kubernetes admin", Kubernetes is one layer in a broader platform problem.

What You’ll Need

Typically 10+ years of experience in Site Reliability Engineering, Infrastructure Engineering, or Platform Engineering.
Deep experience designing and operating distributed systems at scale, including cloud platforms (e.g., AWS), Kubernetes, and infrastructure-as-code.
Strong expertise in reliability engineering practices, including incident management, fault isolation, resiliency design, and system performance tuning.
Experience building and operating CI/CD systems, test harnesses, and automated validation frameworks.
Strong understanding of observability systems, including metrics, logging, tracing, and system-level debugging.
Demonstrated ability to define technical standards and influence multiple teams through architecture, design review, and strong engineering judgment.
Strong production mindset, with experience designing systems for scalability, availability, and operational efficiency.
Experience implementing secure, multi-tenant infrastructure with strong isolation, IAM, and secrets management practices.
Excellent cross-functional collaboration skills.
Growth mindset and One Team orientation.

And It’s Great to Have

Experience supporting AI/LLM-powered systems in production, including understanding of latency, cost, and orchestration challenges.
Experience designing high-throughput, isolated compute systems or sandboxed execution environments.
Experience building internal developer platforms or platform-as-a-product capabilities.
Familiarity with governance or regulated environments.
Experience with advanced validation systems such as canarying, chaos engineering, or automated rollback strategies.

What Success Looks Like

Faster delivery through platform-enabled validation and automation.
Automated validation of changes before production, reducing reliance on manual review.
Platform standards adopted across teams as the default paved path.
Early detection of reliability issues through strong observability and validation systems.
Reduced infrastructure complexity so engineers can focus on product and policy.

Why This Matters

ABC Fitness is evolving toward an AI-native engineering model where automation, agents, and platform systems handle increasing portions of the software lifecycle. This role builds the foundation that enables scalable, trustworthy, and high-velocity software delivery across the organization.

#J-18808-Ljbffr