Site Reliability Engineer
Listed on 2025-11-02
-
IT/Tech
Systems Engineer, Cloud Computing, SRE/Site Reliability, IT Support
Join to apply for the Staff Site Reliability Engineer role at Grindr
.
This range is provided by Grindr. Your actual pay will be based on your skills and experience — talk with your recruiter to learn more.
Base Pay Range$/yr - $/yr
LocationHybrid role based in our Chicago or Palo Alto offices. You will be required to be in the office Tuesdays and Thursdays.
What’s so interesting about this role?The Site Reliability Engineering (SRE) team at Grindr is responsible for ensuring our systems are stable, performant, and scalable as we continue to grow globally. This role reports directly to the Director of Technical Operations and plays a critical part in keeping our infrastructure running reliably while supporting both backend and operations teams. By driving improvements in automation, incident response, and performance optimization, this position ensures Grindr can deliver a safe, reliable, and seamless experience to millions of users worldwide.
The team’s work directly impacts uptime, efficiency, and overall system resilience, supporting Grindr’s broader roadmap of building a secure and high‑performing platform for the LGBTQ+ community.
- Monitoring and Alerting:
Set up and maintain monitoring systems to track the health and performance of applications and infrastructure. Create and manage alerting mechanisms to detect and respond to issues quickly. - Incident Response:
Handle incidents and outages, working to resolve them swiftly and minimize downtime. Perform root cause analysis to prevent future occurrences and improve system resilience. - Automation:
Develop tools and scripts to automate repetitive tasks, such as deployments, monitoring, and scaling, to increase efficiency and reduce human error. - Performance Optimization:
Analyze system performance and identify bottlenecks or areas for improvement. Work with development teams to optimize code and infrastructure for better performance and resource utilization. - Capacity Planning:
Plan for future growth by analyzing current usage trends and forecasting resource needs. Ensure systems can handle increased load without compromising performance or reliability. - SLOs & SLAs:
Define and measure SLOs and SLAs to set expectations for system reliability and performance. Track these metrics and work to maintain or exceed the defined standards. - Incident Management and Postmortems:
Conduct post‑mortems after incidents to document what went wrong, what was done to fix it, and how to prevent similar incidents. This process drives continuous improvement and learning from failures. - Collaboration with Development Teams:
Work closely with software developers to integrate reliability and performance into the development process. Provide guidance on best practices and assist with designing resilient systems. - Security and Compliance:
Ensure systems are secure and compliant with relevant regulations and standards. Implement security measures, monitor for vulnerabilities, and respond to security incidents. - Continuous Improvement:
Continuously look for ways to improve system reliability, performance, and efficiency. Stay updated with industry trends and advancements to implement best practices and technologies. - Participate in an on‑call rotation.
- 5+ years of experience in site reliability including incident response, incident management, automation and performance optimization.
- 5+ years of experience in cloud platforms (AWS preferred).
- 4+ years of experience working with Dev Ops technologies such as Docker, Kubernetes, Helm, and Terraform.
- 4+ years developing and maintaining CI/CD pipelines.
- 4+ years experience using a scripting language like Python or Bash.
- Experience coding in Kotlin or another JVM language is a plus.
- Technical Expertise:
- Proficient in at least one programming language (e.g., Python, Go, Java).
- Strong knowledge of Linux/Unix systems.
- Experience with cloud platforms (e.g., AWS, GCP, Azure).
- Familiarity with containerization and orchestration tools (e.g., Docker, Kubernetes).
- Understanding of networking concepts and protocols.
- Reliability Engineering:
- Experience with monitoring, logging, and alerting tools (e.g.,…
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).