×
Register Here to Apply for Jobs or Post Jobs. X

Cloud Infrastructure – Site Reliability Engineer; SRE-Sunnyvale

Job in Sunnyvale, Santa Clara County, California, 94085, USA
Listing for: Alibaba Cloud
Full Time position
Listed on 2026-03-04
Job specializations:
  • IT/Tech
    Cloud Computing, Systems Engineer, SRE/Site Reliability, Data Engineer
Job Description & How to Apply Below
Position: Cloud Infrastructure – Site Reliability Engineer (SRE)-Sunnyvale
Job Summary
:
Alibaba Cloud is responsible for innovative messaging products and is seeking a Site Reliability Engineer to oversee the stability and performance of cloud middleware systems. The role involves managing the lifecycle of containerized middleware on Kubernetes, leading incident responses, and developing automation tools to enhance operational efficiency.

Responsibilities
:

• Oversee stability maintenance, performance tuning, and high-availability architecture design for cloud middleware, including messaging middleware (Kafka/Rocket

MQ).

• Manage the containerized middleware lifecycle on Kubernetes clusters: implement deployments, auto-scaling, version upgrades, and resource optimization in K8s environments.

• Lead the troubleshooting of middleware-related incidents (e.g., message backlog, service registration failures) through log analysis, distributed tracing, and monitoring systems.

• Develop diagnostic tools using Java/Go to resolve production issues, performance bottlenecks, and compatibility challenges.

• Build Python/Go/Shell automation tools to standardize middleware deployment, monitoring, and disaster recovery workflows.

• Implement chaos engineering experiments, capacity planning strategies, and failover mechanisms to enhance system resilience.

Qualifications
:
Required
:

• Over 2 years of experience in distributed systems reliability engineering

• Familiar with high-availability architecture design

• Proficient in at least one of Python, Go, or Java

• Cluster management, message reliability assurance, and performance optimization for Kafka/RocketMQ

Hands-on Experience Deploying Middleware On Kubernetes

• Ability to convert operations experience into automated solutions

• Familiarity with various message middleware, e.g., Kafka and RocketMQ

• Strong scripting skills in Shell/Python

• Experience with Infrastructure as Code (IaC) tools (Terraform preferred)

Preferred
:

• Familiar with core SRE practices (incident review, error budgeting, chaos engineering)

• Experienced in building automated risk control systems

Hands-on Experience Deploying Middleware On Kubernetes (Helm/Operator Preferred)

Company
:
Alibaba Cloud develops cloud computing and data management services. It is a sub-organization of Alibaba Group. Founded in 2009, the company is headquartered in Hangzhou, CHN, with a team of 10001+ employees. The company is currently Late Stage.
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
 
 
 
Search for further Jobs Here:
(Try combinations for better Results! Or enter less keywords for broader Results)
Location
Increase/decrease your Search Radius (miles)

Job Posting Language
Employment Category
Education (minimum level)
Filters
Education Level
Experience Level (years)
Posted in last:
Salary