Site Reliability Engineer Job London area,Greater London England UK,IT/Tech

Location: Greater London

Requirements

Master’s degree in Computer Science, Software Engineering, Systems Engineering, Robotics, or equivalent experience
7+ years of experience:
Proven track record in SRE, Dev Ops, or Systems Engineering with a focus on IoT, remote devices, or distributed edge hardware
Deep proficiency in Linux/Unix systems (Debian/Ubuntu preferred), including kernel tuning, shell scripting (Python, Bash), and networking protocols (TCP/IP, MQTT, CoAP, HTTPS/REST, DNS)
Knowledge of security best practices for IoT and remote devices, including secure boot, encryption at rest/in transit, and certificate management
Expert proficiency in Python, Rust, or Go-based configuration management (Ansible/Terraform) for fleet-wide deployments
Strong understanding of SRE principles, including SLIs/SLOs, error budgets, and automation over manual "toil."
Experience with enterprise MDM or Unified Endpoint Management (UEM) platforms (such as Jamf Pro, Microsoft Intune, Fleet

DM, Mosyle, Esper, 42

Gears SureMDM, SOTI Mobi Control, VMware Workspace ONE, or Headwind MDM)
Experience with open-source device management solutions is a plus (such as Fleet

DM, Mender.io, Balena, Micromdm, Memfault, or RAUC)
Experience with building Linux images and containers (with tools such as Yocto, PTXdist, ubuntu-image, Packer, Debian live-build, debootstrap)
Experience with Linux packaging formats (such as deb, snap, flatpak, nixpkg)
Hands-on experience troubleshooting hardware interfaces, specifically USB/Bluetooth barcode scanners and industrial touchscreen displays
Experience configuring and locking down browsers or native apps into dedicated kiosk environments on both Linux and mobile OSs
Hands-on experience with cloud infrastructure (AWS or Azure) and containerization technologies like Docker and Kubernetes
Experience with CI/CD pipelines tailored for edge device deployment
Experience with ROS (Robot Operating System) or managing hardware-in-the-loop systems is a plus
Background in warehouse automation, logistics, or industrial IoT

What the job involves

Locus Robotics is seeking a Site Reliability Engineer (SRE) with a specialized focus on Remote Device Management. As a core member of our reliability team, you will ensure the stability, security, and scalability of the Locus

ONE platform supporting our growing fleet of Autonomous Mobile Robots (AMRs), peripherals, and reporting devices
You will bridge the gap between software development and field operations, using Linux expertise and Mobile Device Management (MDM) tools to manage thousands of edge devices globally
Fleet Management at Scale:
Design, implement, and maintain robust and secure device management strategies for remote devices using Unified Endpoint Management (UEM), MDM solutions, and orchestration tools
Reliability & Monitoring:
Develop and manage observability pipelines to track device health, connectivity, and performance metrics across diverse warehouse environments
OTA & Lifecycle Management:
Own the end-to-end lifecycle of device software, including secure Over-the-Air (OTA) firmware updates, rollback strategies, and OS hardening
Incident Response:
Participate in on-call rotations to troubleshoot complex system failures, performing root cause analysis (RCA) to drive long-term reliability improvements
Self-Healing Infrastructure:
Develop automated remediation scripts that detect and fix common edge issues such as hung scanning processes or display driver freezes without manual intervention
Zero-Touch Scalability:
Architect and maintain remote provisioning and management workflows for a global fleet of Linux, iPads, and Android devices using secure remote management strategies
Secure Remote Access:
Implement and manage secure remote access protocols such as SSH, VPNs, and private APNs to enable out-of-band troubleshooting and real-time device control without physical site visits
SLO/SLI Frameworks:
Define and enforce Service Level Objectives (SLOs) and Service Level Indicators (SLIs) for device availability, connectivity, and peripheral performance
Error Budget Management:
Use error budgets to balance the pace of innovation with fleet reliability, ensuring data-driven decisions for feature releases versus stability fixes
Security Governance:
Align fleet operations with industry standards such as the NIST Cybersecurity Framework (CSF), ISO/IEC 27001, and CIS Controls
Vulnerability Management:
Drive continuous monitoring and automated patching schedules to mitigate risks and ensure regulatory compliance across all managed device platforms

#J-18808-Ljbffr