Más empleos:
Reliability Engineering Manager
Trabajo disponible en:
Santiago Mexquititlán Barrio Primero, Querétaro, México
Publicado en 2026-01-03
Empresa:
Petco
Tiempo completo
posición Publicado en 2026-01-03
Especializaciones laborales:
-
TI/Tecnología
Cloud, Ingeniería de confiabilidad del sitio/Confiabilidad del sitio, Ingeniero de sistemas, Gerente de Proyectos TI
Descripción del trabajo
Summary:
Lead and grow a team of Reliability Engineers responsible for designing, implementing, and operating the platforms, practices, and tooling that ensure the availability, performance, and resilience of our production systems. You will drive reliability, scalability, and operational excellence across the software delivery lifecycle, while aligning reliability goals with business and product objectives. This role requires strong leadership, deep technical understanding of distributed systems and SRE practices, and a strategic mindset to manage risk, guide incident response, and continuously improve reliability outcomes.
Duties & Responsibilities:
- Lead and manage a team of Site Reliability Engineers, providing coaching, mentorship, and performance feedback.
- Partner with senior leadership to define reliability objectives and align SRE strategies with overall business and product goals.
- Define, implement, and evolve SLOs, SLIs, and error budgets in collaboration with product and engineering teams.
- Oversee the reliability, performance, and capacity of production systems, including incident management, post-incident reviews, and problem management.
- Drive automation for operational tasks, deployments, and recovery playbooks to reduce toil and improve consistency.
- Design and maintain infrastructure and platform reliability using infrastructure as code tools such as Terraform, Ansible, or similar.
- Guide the implementation and management of containerized and cloud-native platforms (for example, Kubernetes) with a focus on resilience, scalability, and safe rollouts.
- Own observability practices and tooling (logging, metrics, tracing, alerting) to ensure proactive detection and fast diagnosis of issues.
- Champion best practices for security, compliance, and governance in production environments.
- Collaborate with cross-functional teams to ensure reliability is considered in architecture, design, and release planning.
- Foster a culture of blameless incident reviews, learning, and continuous improvement within the Reliability Engineering organization.
- Manage relationships with external vendors and service providers that support reliability, monitoring, and infrastructure needs.
Minimum Qualifications:
- Bachelor’s degree in Computer Science, Engineering, or a related field.
- 5+ years of experience in Site Reliability Engineering, Production Engineering, or related fields, including at least 2 years in a leadership or management role.
- Strong proficiency in scripting or programming languages such as Python, Java, Node Js, or Next Js.
- Experience operating large-scale systems on cloud platforms such as AWS, Azure, or Google Cloud Platform.
- In-depth knowledge of containerization and orchestration technologies such as Docker and Kubernetes.
- Experience with infrastructure as code and configuration management tools (for example, Terraform, Ansible, or similar).
- Hands-on experience with observability and incident management tools (for example, Prometheus, Grafana, Datadog, Pager Duty, or equivalents).
- Solid understanding of SRE principles, including SLOs/SLIs, error budgets, capacity planning, and incident response.
- Excellent problem-solving, troubleshooting, and communication skills, with the ability to influence and collaborate across teams.
Tenga en cuenta que actualmente no se aceptan solicitudes desde su jurisdicción. Las preferencias de los candidatos son decisión del empleador o del agente reclutador.
Para buscar, ver y solicitar empleos que acepten solicitudes de su ubicación o país, toque aquí para realizar una búsqueda:
Para buscar, ver y solicitar empleos que acepten solicitudes de su ubicación o país, toque aquí para realizar una búsqueda:
Busque más trabajos aquí:
×