Platform Engineer — Cloud & Infrastructure Automation and Observation
Listed on 2026-02-10
-
IT/Tech
Systems Engineer, Cloud Computing
Overview
Platform Engineer — Cloud & Infrastructure Automation and Observation. We are seeking a Platform Engineer to join our technology team and play a central role in managing and automating our hybrid cloud and on-premises infrastructure. Working closely with the Technology Director, Development and IT & Systems teams, you will help drive automation, reliability and operational excellence across the full technology estate.
Our infrastructure operates across a hybrid model spanning multiple cloud providers and on-premises environments, supporting a fast-growing, high-volume e-commerce operation. You will champion Infrastructure as Code, build robust CI/CD and deployment pipelines, establish comprehensive observability, and drive the cultural shift towards modern Dev Ops practices across the engineering organisation.
- Infrastructure Automation & Management
- Infrastructure as Code:
Define, provision and manage cloud and on-premises infrastructure using IaC tools (Cloud Formation, Terraform, Ansible or similar), eliminating manual configuration and ensuring repeatable, version-controlled environments - Hybrid Cloud Management:
Manage and optimise infrastructure across multiple cloud providers and on-premises environments, ensuring consistent governance, security and cost efficiency across the entire estate - On-Premises & Local Infrastructure:
Work alongside the IT & Systems team to manage local server infrastructure including Windows Server environments (Domain Controllers, Hyper-V, application and file servers), Linux systems and network security appliances; use IaC tools such as Terraform and Packer to automate the provisioning of local virtual machines and container clusters, ensuring local environments match production standards - Infrastructure Lifecycle Management:
Oversee server maintenance, security patching, storage provisioning and networking equipment management across both cloud and local infrastructure, ensuring consistent standards regardless of where workloads run
- Infrastructure as Code:
- CI/CD, Deployments & Release Engineering
- CI/CD Pipeline Development:
Design, build and maintain continuous integration and deployment pipelines using Git Hub Actions, Cloud Build and related tooling, enabling rapid, reliable releases across all environments - Controlled Rollouts & Deployment Strategies:
Implement blue-green deployments, canary releases and rolling updates for application, database and infrastructure changes, minimising disruption and enabling safe rollback - Database Deployments:
Manage and automate database schema migrations and deployments, ensuring zero-downtime releases through controlled rollout strategies - Runtime Mitigation:
Utilise tooling to patch or isolate vulnerable containers in production without interrupting service, enabling rapid response to security findings - Build Reliability:
Monitor pipeline health and implement automated alerting for build failures, ensuring the team addresses delivery blockers immediately
- CI/CD Pipeline Development:
- Observability, Monitoring & Alerting
- Full-Stack Observability:
Architect and maintain a comprehensive observability strategy across all systems, consolidating and extending existing monitoring infrastructure (Zabbix, Cloud Watch) with modern tooling such as Grafana, Loki, New Relic or Datadog to ensure proactive alerting and full visibility - Automated Incident Management:
Set up integrations between monitoring tools and Jira Service Management to automatically generate incident tickets when production systems fail or breach performance thresholds, and automate ticket triage, prioritisation and escalation - Workflows:
Pipeline & Build Alerting:
Configure automation to raise Jira tasks or bugs when critical deployment pipelines fail, ensuring delivery blockers are tracked and resolved promptly - Visibility & Reporting:
Build dashboards and automated reporting for incident tracking, post-mortem outcomes and system health, providing transparency to engineering leadership
- Full-Stack Observability:
- Security & Vulnerability Management
- Cloud Security Posture:
Maintain and enhance security tooling including Guard Duty, Security Hub, Macie and Inspector; manage secrets, IAM policies and network segmentation to ensure compliance with PCI-DSS and data protection requirements - Dev Sec Ops Integration:
Integrate application security scanning tools such as Snyk into CI/CD pipelines, shifting security left and embedding vulnerability detection into the development workflow
- Cloud Security Posture:
- Reliability, Cost & Performance
- Reliability Engineering:
Implement SLIs, SLOs and error budgets; design and conduct game days and disaster recovery exercises; lead incident response and blameless post-mortems to continuously improve system resilience - Capacity Planning:
Proactively manage capacity across non-autoscaling and autoscaling architectures, ensuring readiness for peak trading events (Black Friday, Cyber Monday, seasonal promotions) through load testing and performance benchmarking - Cost Management:
Monitor and optimise spend across cloud providers and local infrastructure,…
- Reliability Engineering:
To Search, View & Apply for jobs on this site that accept applications from your location or country, tap here to make a Search: