Sr. Site Reliability Engineer Job Houston area,Texas USA,IT/Tech

Join us to build enterprises of tomorrow Sr. Site Reliability Engineer Job Description

We are seeking a skilled Site Reliability Engineer (SRE) to join our team. In this role, you will be responsible for bridging the gap between development and operations by applying software engineering principles to infrastructure and operations tasks. Your primary focus will be ensuring the reliability, availability, performance, and scalability of our production systems while minimizing manual operational work through automation and enhancing system resilience.

Position Overview

The Site Reliability Engineer will work closely with development and operations teams to design, implement, and maintain highly reliable systems. You will be instrumental in establishing best practices for observability, incident response, and infrastructure management. Your expertise will help reduce operational overhead, improve system performance, and ensure seamless deployments through CI/CD pipelines.

Qualifications
Required Skills and Experience

Bachelor’s degree in Computer Science, Engineering, or equivalent practical experience
3+ years of experience in SRE, Dev Ops, or similar roles
Strong proficiency with Kubernetes (K8s) and Docker containerization
Experience with the ELK stack (Elasticsearch, Logstash, Kibana) for logging and monitoring
Good to have:
Understanding of Java programming and Java application troubleshooting
Working knowledge of SQL and Mongo

DB databases
Familiarity with Angular for frontend monitoring and diagnostic tooling
Strong understanding of system architecture, cloud infrastructure, and networking
Experience with Infrastructure as Code (IaC) tools (e.g., Terraform, Ansible)
Experience with monitoring and observability platforms
Excellent problem-solving skills and ability to troubleshoot complex systems
Strong verbal and written communication skills

Preferred Skills

Must have:
Experience with AWS public cloud
Good to have:
Knowledge of Azure and GCP
Familiarity with CI/CD tools (Jenkins, Git Lab CI, Git Hub Actions)
Understanding of service mesh technologies (e.g., Istio)
Experience with scripting languages (Python, Bash)
Understanding of distributed systems and microservices
Experience implementing SLOs, SLIs, and SLAs
Awareness of security best practices
Certifications in relevant technologies (e.g., CKA, AWS Certified)

Roles and Responsibilities
System Reliability and Performance

Design, implement, and maintain highly available and scalable infrastructure
Define and track SLOs, SLIs, and error budgets
Conduct capacity planning and optimize performance
Improve system resilience and fault tolerance
Perform regular health checks and proactive maintenance

Monitoring and Observability

Deploy and maintain monitoring solutions (e.g., ELK stack)
Build dashboards for system metrics, logs, and app performance
Set up alerting systems to reduce alert fatigue
Implement distributed tracing and ensure service telemetry
Maintain comprehensive logging across systems

Incident Management and Response

Lead incident response, including mitigation and resolution
Conduct root cause analysis and post-incident reviews
Maintain incident runbooks and knowledge base
Participate in on‑call rotation for critical systems

Automation and Toil Reduction

Identify and automate repetitive operational tasks
Implement Infrastructure as Code for consistent provisioning
Automate testing and deployment processes
Design and maintain reliable CI/CD pipelines
Implement automated testing within workflows
Support canary deployments, feature flagging, and rollback strategies

Infrastructure Management

Manage Kubernetes clusters and containerized applications
Oversee config management and version control
Implement infrastructure security and compliance
Optimize resources and ensure backup/disaster recovery

Collaboration and Knowledge Sharing

Partner with development teams to enhance reliability
Provide architectural guidance with an SRE lens
Conduct documentation and knowledge‑sharing sessions
Promote SRE best practices across the organization
Collaborative, improvement‑driven team culture
Exposure to cutting‑edge technologies
Balance of project and operational responsibilities
Focus on automation, innovation, and resilience
Strong emphasis on learning and growth

Success Metrics

Improved system availability and reliability
Reduction in MTTD and MTTR
Fewer production incidents and outages
Increased automation and reduced manual effort
Successful SLO implementation and monitoring coverage
Positive feedback from dev teams on SRE support

Transform people experience in your enterprise of tomorrow

#J-18808-Ljbffr


Increase/decrease your Search Radius (miles)



Job Posting Language