Site Reliability Engineer Job Palo Alto area,California USA,IT/Tech

Position: Staff Site Reliability Engineer
Join to apply for the Staff Site Reliability Engineer role at Grindr

Get AI-powered advice on this job and more exclusive features.

This range is provided by Grindr. Your actual pay will be based on your skills and experience — talk with your recruiter to learn more.

Base pay range

$/yr - $/yr

This is a hybrid role based in our Chicago or Palo Alto offices and will require you to be in office Tuesdays and Thursdays.

What’s so interesting about this role?

The Site Reliability Engineering (SRE) team at Grindr is responsible for ensuring our systems are stable, performant, and scalable as we continue to grow globally. This role reports directly to the Director of Technical Operations and plays a critical part in keeping our infrastructure running reliably while supporting both backend and operations teams. By driving improvements in automation, incident response, and performance optimization, this position ensures Grindr can deliver a safe, reliable, and seamless experience to millions of users worldwide.

The team’s work directly impacts uptime, efficiency, and overall system resilience, supporting Grindr’s broader roadmap of building a secure and high‑performing platform for the LGBTQ+ community.

What’s the job?

• Monitoring and Alerting:
Set up and maintain monitoring systems to track the health and performance of applications and infrastructure. Create and manage alerting mechanisms to detect and respond to issues quickly.

• Incident Response:
Handle incidents and outages, working to resolve them swiftly and minimize downtime. Performing root cause analysis to prevent future occurrences and improve system resilience.

• Automation:
Develop tools and scripts to automate repetitive tasks, such as deployments, monitoring, and scaling, to increase efficiency and reduce human error.

• Performance Optimization:
Analyze system performance and identify bottlenecks or areas for improvement. Work with development teams to optimize code and infrastructure for better performance and resource utilization.

• Capacity Planning:
Plan for future growth by analyzing current usage trends and forecasting resource needs. Additionally, you’ll ensure that systems can handle increased load without compromising performance or reliability.

• Service Level Objectives (SLOs) and Service Level Agreements (SLAs):
Define and measure SLOs and SLAs to set expectations for system reliability and performance. Track these metrics and work to maintain or exceed the defined standards.

• Incident Management and Postmortems:
After incidents, conduct post mortems to document what went wrong, what was done to fix it, and how to prevent similar incidents in the future. This process helps in continuous improvement and learning from failures.

• Collaboration with Development Teams:
Work closely with software developers to integrate reliability and performance into the development process. Provide guidance on best practices and assist with designing resilient systems.

• Security and Compliance:
Ensure that systems are secure and compliant with relevant regulations and standards. They implement security measures, monitor for vulnerabilities, and respond to security incidents.

• Continuous Improvement:
Continuously look for ways to improve system reliability, performance, and efficiency. Stay updated with industry trends and advancements to implement the best practices and technologies.

• Participate in an on‑call rotation.

What We’ll Love About You

• 5+ years of experience in site reliability including incident response, incident management, automation and performance optimization

• 5+ years of experience in cloud platforms (AWS preferred)

• 4+ years of experience working with Dev Ops technologies such as Docker, Kubernetes, Helm, and Terraform

• 4+ years developing and maintaining CI/CD pipelines

• 4+ years experience using a scripting language like python or bash

• Experience coding in Kotlin or another JVM language is a plus

We’ll Really Swoon if You Have

• Technical Expertise:
- Proficient in at least one programming language (e.g., Python, Go, Java).
- Strong knowledge of Linux/Unix systems.

- Experience with cloud platforms (e.g., AWS, GCP, Azure).
-…


Increase/decrease your Search Radius (miles)



Job Posting Language