Product Reliability Engineering Lead Job Houston area,Texas USA,IT/Tech

Immediate need for a talented Product Reliability Engineering Lead
. This is a 12+ Months Contract opportunity with long-term potential and is located in US (Remote-CST). Please review the job description below and contact me ASAP if you are interested.

Job
-15460

Pay Range: $85 - $95/hour. Employee benefits include, but are not limited to, health insurance (medical, dental, vision), 401(k) plan, and paid sick leave (depending on work location).

Key Responsibilities:

Define and lead the reliability strategy for the Acquisition Platform, ensuring alignment with product, platform, and enterprise goals.
Establish SLOs, SLIs, and error budgets that tie reliability targets to business outcomes and partner expectations.
Shift reliability requirements into early design and development phases so resiliency, failover, and graceful degradation are architected in, not bolted on.
Design reliability patterns across platform services, APIs, workflows, and dependent systems both internal and external.
Architect end to end observability across the platform including metrics, structured logging, distributed tracing, and alerting.
Establish monitoring standards and dashboards that provide real time visibility into platform health, partner facing services, and integration dependencies.
Embed observability into platform services from design through deployment so teams can detect, diagnose, and resolve issues rapidly.
Drive adoption of synthetic monitoring and canary deployments to validate production behavior proactively.
Collaborate closely with the Acquisition delivery team and stakeholders to align outcomes with the reliability strategy.
Partner with AMS, infrastructure, and other tech teams to ensure clear ownership boundaries and smooth operational handoffs.
SRE principles – SLOs, SLIs, error budgets, toil reduction, blameless postmortems
Observability design – distributed tracing, APM telemetry, structured logging, real time alerting, synthetic monitoring
Resilience and fault tolerance – circuit breakers, bulkheads, retry/backoff, graceful degradation, failover validation
Chaos engineering and reliability testing – fault injection, load/stress testing, failure mode analysis
CI/CD reliability integration – automated reliability gates, canary deployments, feature flags, progressive rollouts
AI assisted reliability techniques – anomaly detection, predictive alerting, prompt driven runbook automation, agent based remediation
Responsible AI use – including consideration of security, data exposure, and operational risk
Cloud native operations – containerized platforms, event driven architectures, infrastructure as code
Growth oriented mindset – ability to think beyond constraints of today and identify what is required to build the future
Excellent communication skills – ability to translate reliability concerns between engineering, product, and business teams

Key Requirements and Technology Experience:

Must have skills: - Site Reliability Engineering (SRE), AWS Cloud (EKS/ECS/Lambda), Observability & Monitoring (Prometheus/Grafana/Datadog/Splunk), Kubernetes & CI/CD Automation, Chaos Engineering & Reliability Testing, SLO/SLI/Error Budget Management
5+ years of experience in site reliability engineering, platform engineering, or production operations roles
Experience defining and operating SLO/SLI frameworks tied to business outcomes
Hands on experience designing observability for distributed, API driven platforms
Experience with reliability and resiliency testing including chaos engineering and fault injection
Experience guiding and mentoring engineers on reliability practices
Enterprise scale delivery experience with both onshore and offshore cross functional teams
Direct experience applying Agile methodologies in product centric delivery models
AWS operational experience – Cloud Watch, X Ray, Fault Injection Simulator, ECS/EKS, Lambda, Event Bridge
Experience integrating reliability practices with Dev Sec Ops and CI/CD pipelines
Familiarity with AI/ML driven operations tools and incident management platforms

Our client is a leading Insurance Industry and we are currently interviewing to fill this and other similar contract positions. If you…