×
Register Here to Apply for Jobs or Post Jobs. X

Senior Site Reliability Engineer

Job in San Francisco, San Francisco County, California, 94199, USA
Listing for: OutSystems
Full Time position
Listed on 2026-06-15
Job specializations:
  • IT/Tech
    SRE/Site Reliability, Cloud Computing: Infrastructure & Operations, Systems Engineer
Salary/Wage Range or Industry Benchmark: 80000 - 100000 USD Yearly USD 80000.00 100000.00 YEAR
Job Description & How to Apply Below

Location

Hybrid Onsite in Menlo Park, CA

About the Role

Site Reliability Engineering (SRE) blends software engineering with infrastructure and operations to create scalable, highly reliable systems. The main goals of an SRE are to ensure production reliability, performance, and scalability while enabling rapid development of new features and services. SREs at Out Systems act as an extension of development teams, adopting reliability tenets to meet Service Level Objectives (SLOs) and deliver a frictionless customer experience.

Key Responsibilities
  • Lead and onboard services and teams to the reliability tenets.
  • Establish and maintain Service Level Objectives (SLOs) and Service Level Agreements (SLAs).
  • Design, implement, and secure scalable infrastructure following cloud‑native best practices.
  • Collaborate with software development teams to ensure systems are resilient, observable, fault‑tolerant, recoverable, and performant.
  • Implement monitoring, alerting, logging, and tracing solutions to detect and respond to incidents.
  • Lead incident response efforts, ensuring quick resolution, minimal downtime, and conduct root cause analysis/post‑mortems.
  • Automate operational tasks, focusing on fast incident detection and recovery.
  • Write automation in Python, using Gen AI tooling to accelerate development of mission‑critical tools.
  • Foster a culture of continuous improvement and knowledge sharing.
  • Communicate effectively with stakeholders, providing updates on system reliability and performance.
  • Participate in on‑call rotation to provide 24/7 support for production systems.
Performance Indicators
  • SLA and Service Level Objectives (SLO) compliance.
  • SLO coverage and detection ratio.
  • Mean time to acknowledge (MTTA).
  • Mean time to resolve (MTTR).
Qualifications
  • BS or MS in Computer Science or Equivalent.
  • 6+ years of experience in Site Reliability Engineering, managing infrastructure and services at scale.
  • History of end‑to‑end project delivery.
  • Experience managing Hadoop and Kubernetes infrastructure or equivalent.
  • Advanced knowledge of Linux, Networking, and Containers.
  • Proficiency in at least one high‑level programming language (Python, Go, etc.).
  • Strong troubleshooting and debugging skills.
  • Fluency in English and excellent communication skills.
  • Hands‑on experience with Prompt engineering in software development.
  • Familiarity with AI Native IDEs or AI Assistants such as Cursor, Git Hub CoPilot, and Claude.
Soft Skills
  • Communication: able to communicate effectively in English orally and in writing, showing empathy.
  • Collaboration:

    proactive presentation skills to represent the SRE team with leadership.
  • Humbleness: accepts mistakes, apologizes, and mitigates impact promptly.
  • Accountability: takes ownership of problems and ensures closure, involving others when needed.
  • Negotiation: handles complex conversations, defusing disagreements toward mutual agreement.
  • Process Oriented: organized, follows defined processes, and challenges inefficiency to suggest improvements.
  • Problem‑solving: applies a top‑down approach, breaking problems into smaller pieces and analyzing objectively.
Technical Skills
  • Experience establishing, monitoring, and improving SLOs, SLIs, and SLAs aligned with business needs.
  • Containerization technologies and orchestration (Kubernetes, EKS); preferred certifications: CKA, CKAD, CKS.
  • Infrastructure as Code: AWS Cloud Formation, Terraform, Puppet, Chef, Spacelift, etc.
  • Automation scripting:
    Python, Go, Bash/Shell, or other languages.
  • AWS services: EC2, RDS, ELB, Cloud Front, Lambda, etc.
  • Monitoring & troubleshooting of complex distributed systems:
    Grafana, ELK stack, Prometheus, etc.
  • Designing resilient and fault‑tolerant systems.
  • Debugging complex distributed systems.
Benefits
  • A company at the vanguard of the agentic revolution, offering high‑growth, startup agility within an enterprise foundation.
  • Real growth opportunities through structured programs, professional development funds, and internal mobility.
  • A global collective of world‑class talent and mentors invested in your growth.

As an equal opportunity employer, all qualified applicants receive equal consideration regardless of race, origin, religion, sex, sexual orientation, gender identity, disability, veteran status, or any other protected status.

#J-18808-Ljbffr
Position Requirements
10+ Years work experience
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
 
 
 
Search for further Jobs Here:
(Try combinations for better Results! Or enter less keywords for broader Results)
Location
Increase/decrease your Search Radius (miles)
0
200
Filters
Education Level
Experience Level (years)
Posted in last:
Salary