Senior Site Reliability Engineer - APAC Job Singapore,IT/Tech

Who are Tyk, and what do we do?

The Tyk API Management platform is helping to drive the connected world and power new products and services. We’re changing the way that organisations connect any number of their systems and services. Whether internal, external, public or highly encrypted systems, Tyk helps businesses drive value across the retail, finance, telecoms, healthcare, or media industries (to name just a few!)

Founded in 2015 with offices in London - UK, London - Ontario, Atlanta and Singapore, we have many thousands of users of our B2B platform across the globe. Brands using Tyk range from Lotte, Bell, T Mobile, to RBS, Capital One and Vinci. We have a varied user base hailing from every continent – even Antarctica.

Our Mission

Tyk is on a mission to connect every system in the world. We’ve started by building an API Management platform.

Total flexibility, default remote, radical responsibility

We offer unlimited paid holidays and remote working from anywhere in the world, for everyone. We believed this principle of flexibility and autonomy unlocks best performance and enables us to build the best possible team, location and working hours are no barrier.

The role

At Tyk, we’re obsessed with building software that solves problems. Our Site Reliability Engineers (SREs) empower users with a rich feature set, high availability, and stellar performance level to pursue their missions. Our customer base is growing, so we’re seeking an experienced Senior SRE to optimize, automate, and improve performance using insights from massive‑scale data in real time. We want an original thinker, a challenger, a technical legend, an opinionated collaborator who wants to make things better.

Requirements

Lead hands‑on maintenance and optimization of our global Cloud platform within SL(A / I / O) s you'll help define
Collaborate to shape SRE strategy, then translate into actionable technical plans coordinated through SCRUM
Identify reliability issues, drive root cause analysis, and implement solutions alongside your squad
Lead performance tuning and fault finding through analysis of OS and application metrics
Design and implement automation for common operational tasks and cloud‑operations workflows
Develop proactive alerting, monitoring roadmap, and relevant dashboards; define and track KPIs
Participate in on‑call rotation, ensuring effective incident response and resolution within SLAs
Conduct blame‑free post‑mortems, document findings, and maintain operational runbooks
Drive multi‑region and multi‑cloud platform expansion with focus on scalability and automation
Optimize infrastructure performance and cost efficiency without impacting service delivery
Engage with commercial teams on growth plans and translate into technical SRE strategies
Coordinate penetration testing through provider liaison, technical setup, and environment configuration
Champion continuous improvement across processes, communication, and team practices
Model excellence in software design and knowledge sharing
Plan and execute software upgrades to enhance cloud services

Experience required

Experience in an SRE role
Strong knowledge of cloud technologies and SLA SLO SLI management
Excellent communication and leadership skills
Ability to analyze and improve operational processes and performance metrics
Experience in software design, automation, and root cause analysis
On‑call support experience and customer‑focused mindset
Collaborative attitude with commercial and technical teams
Launching and operating production Kubernetes clusters
Designing and operating infrastructure on AWS and other providers
Operating Mongo

DB (or other document database) clusters
Operating Redis (or other key‑value storage) clusters
Administering Linux servers
Operating Prometheus and Grafana
Operating logging collection and analysis system
Participating in the on‑call rotation (4 : 00am - 16 : 00pm UTC)

Skills

Kubernetes (administrator)
Go and / or Python (advanced)
AWS / EKS (advanced)
Linux (advanced)
Terraform and IaC in general (proficient)
Helm (proficient)
Mongo

DB (or similar)
Redis (or similar)
Monitoring – prometheus, grafana, thanos (familiar)
Grasp of networking concepts (subnets, routing,…


Increase/decrease your Search Radius (miles)



Job Posting Language