Site Reliability Engineer,Cloud Incident Response Job Central London area,City Of London England UK,IT/Tech

Location: City Of London

As a leading financial services and healthcare technology company based on revenue, SS&C is headquartered in Windsor, Connecticut, and has 27,000+ employees in 35 countries. Some 20,000 financial services and healthcare organizations, from the world's largest companies to small and mid-market firms, rely on SS&C for expertise, scale, and technology.

Job Description Get To Know Us:

SS&C is leading the way. We continue to look for today's and tomorrow's brightest talent, those who embody a spirit to improve not only their lives, but those around them. From college students to seasoned and experienced professionals, we encourage you to apply. SS&C prides itself on hiring diverse, honest, dynamic individuals who value collaboration, accountability, and innovation, to name a few.

Site

Reliability Engineer

Location:

London office, hybrid - 2 days per week onsite About the Role

We're seeking a hands-on Site Reliability Engineer to enhance our production reliability, scalability, and operability. You'll use your expertise across observability, Kubernetes, AWS, and infrastructure as code to investigate issues, implement tactical fixes quickly, and drive strategic improvements that raise availability and reduce toil. This is a hybrid role with two days per week in the office. You'll collaborate closely with engineering, product, and support to design, build, and run robust platforms that meet demanding SLAs/SLOs.

What

You'll Do

Keep production healthy:
Monitor, troubleshoot, and resolve incidents across services and infrastructure; reduce MTTR and prevent recurrences through high-quality post-incident actions.
Observability as a first-class practice:
Use Grafana, Datadog, and Splunk (and related tools like Prometheus/Open Telemetry) to detect anomalies, root cause issues, and create actionable alerts and dashboards.
Run Kubernetes at scale:
Operate and harden Kubernetes (EKS preferred); manage deployments, autoscaling, rollouts/rollbacks, service mesh/ingress, and cluster upgrades.
Build reliable cloud foundations:
Design and operate AWS workloads (networking, IAM, EC2/EKS, RDS/Aurora, S3, Cloud Watch, ALB/NLB, VPC, Security Groups) with a security-first mindset.
Automate with IaC:
Codify and continuously improve infrastructure using Terraform (modules, work spaces, remote state, policy as code).
Enable fast, safe delivery:
Partner with teams to enhance CI/CD pipelines (e.g., Git Hub Actions/Jenkins/Argo CD), progressive delivery, and change management to lower the change failure rate.
Own reliability metrics:
Define and iterate on SLOs/SLIs/error budgets; champion blameless post-mortems and reliability reviews.
Participate in on-call:
Join a fair, well-documented on-call rota; improve runbooks, automation, and alert quality to make on-call sustainable.
Drive strategic improvements:
Identify systemic issues and deliver durable fixes (architecture, capacity, scaling, caching, resilience patterns, rate limiting, back-pressure, circuit breakers, chaos engineering).

What you will bring

5+ years operating production systems as an SRE, Dev Ops engineer, or software engineer.
Observability:
Hands-on with Grafana, Datadog, and Splunk for incident investigation, dashboarding, alerting, tracing/logs/metrics correlation, and performance analysis.
Kubernetes:
Strong experience running and troubleshooting workloads (controllers, pods, networking, storage, HPA/VPA, Helm/Customise).
AWS:
Solid practical knowledge of core services and best practices for security, cost, and reliability.
Terraform:
Confident with module design, state management, DRY patterns, and CI for IaC.
On-call experience:
Demonstrated participation in a production on-call rota, effective incident communication, and post-incident follow-through.
Scripting & engineering fundamentals:
Proficiency in at least one of Python, Go, or Bash; strong Linux, networking (DNS, TLS, HTTP, TCP), and Git.
Collaboration & communication:
Ability to work cross-functionally, write clear runbooks/RFCs, and influence engineering practices.

Nice-to-Have

EKS internals, cluster autoscaler, managed node groups/Fargate; service mesh (Istio/Linkerd), ingress controllers (Nginx/ALB).
Prometheus, Open…


Increase/decrease your Search Radius (miles)



Job Posting Language

Site Reliability Engineer, Cloud Incident Response