Product Reliability Engineering Lead
Job in
Houston, Harris County, Texas, 77019, USA
Listed on 2026-06-02
Listing for:
Pyramid Consulting, Inc
Full Time
position Listed on 2026-06-02
Job specializations:
-
IT/Tech
Systems Engineer, SRE/Site Reliability, Cloud Computing, IT Support
Job Description & How to Apply Below
. This is a 12+ Months Contract opportunity with long-term potential and is located in US (Remote-CST). Please review the job description below and contact me ASAP if you are interested.
Job
-15460
Pay Range: $85 - $95/hour. Employee benefits include, but are not limited to, health insurance (medical, dental, vision), 401(k) plan, and paid sick leave (depending on work location).
Key Responsibilities:
- Define and lead the reliability strategy for the Acquisition Platform, ensuring alignment with product, platform, and enterprise goals.
- Establish SLOs, SLIs, and error budgets that tie reliability targets to business outcomes and partner expectations.
- Shift reliability requirements into early design and development phases so resiliency, failover, and graceful degradation are architected in, not bolted on.
- Design reliability patterns across platform services, APIs, workflows, and dependent systems both internal and external.
- Architect end to end observability across the platform including metrics, structured logging, distributed tracing, and alerting.
- Establish monitoring standards and dashboards that provide real time visibility into platform health, partner facing services, and integration dependencies.
- Embed observability into platform services from design through deployment so teams can detect, diagnose, and resolve issues rapidly.
- Drive adoption of synthetic monitoring and canary deployments to validate production behavior proactively.
- Collaborate closely with the Acquisition delivery team and stakeholders to align outcomes with the reliability strategy.
- Partner with AMS, infrastructure, and other tech teams to ensure clear ownership boundaries and smooth operational handoffs.
- SRE principles – SLOs, SLIs, error budgets, toil reduction, blameless postmortems
- Observability design – distributed tracing, APM telemetry, structured logging, real time alerting, synthetic monitoring
- Resilience and fault tolerance – circuit breakers, bulkheads, retry/backoff, graceful degradation, failover validation
- Chaos engineering and reliability testing – fault injection, load/stress testing, failure mode analysis
- CI/CD reliability integration – automated reliability gates, canary deployments, feature flags, progressive rollouts
- AI assisted reliability techniques – anomaly detection, predictive alerting, prompt driven runbook automation, agent based remediation
- Responsible AI use – including consideration of security, data exposure, and operational risk
- Cloud native operations – containerized platforms, event driven architectures, infrastructure as code
- Growth oriented mindset – ability to think beyond constraints of today and identify what is required to build the future
- Excellent communication skills – ability to translate reliability concerns between engineering, product, and business teams
- Must have skills: - Site Reliability Engineering (SRE), AWS Cloud (EKS/ECS/Lambda), Observability & Monitoring (Prometheus/Grafana/Datadog/Splunk), Kubernetes & CI/CD Automation, Chaos Engineering & Reliability Testing, SLO/SLI/Error Budget Management
- 5+ years of experience in site reliability engineering, platform engineering, or production operations roles
- Experience defining and operating SLO/SLI frameworks tied to business outcomes
- Hands on experience designing observability for distributed, API driven platforms
- Experience with reliability and resiliency testing including chaos engineering and fault injection
- Experience guiding and mentoring engineers on reliability practices
- Enterprise scale delivery experience with both onshore and offshore cross functional teams
- Direct experience applying Agile methodologies in product centric delivery models
- AWS operational experience – Cloud Watch, X Ray, Fault Injection Simulator, ECS/EKS, Lambda, Event Bridge
- Experience integrating reliability practices with Dev Sec Ops and CI/CD pipelines
- Familiarity with AI/ML driven operations tools and incident management platforms
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
Search for further Jobs Here:
×