Cloud Systems Engineer
Listed on 2026-06-06
-
IT/Tech
Systems Engineer, Cloud Computing, SRE/Site Reliability
Overview
This is a hybrid role - 2 days remote and 3 days in the Malvern, PA office.
We are seeking a highly skilled Site Reliability & Cloud Systems Engineer to design, build, and operate scalable, secure, and highly automated cloud platforms in AWS. This role combines hands‑on reliability engineering with cloud architecture and automation expertise, with a strong emphasis on building immutable infrastructure and improving system resilience.
You will play a key role in evolving our AWS ecosystem into a “push‑button” platform—reducing manual operations, embedding security into every layer, and ensuring production systems are observable, performant, and self‑healing. This role is well‑suited for a proactive engineer who excels at the intersection of infrastructure, automation, and system reliability, blending responsibilities across SRE, Dev Ops, and Cloud Engineering.
Responsibilities Reliability, Performance & Operations- Ensure uptime, reliability, and performance of AWS‑hosted, Linux‑based (Ubuntu) production systems and associated lower environments
- Build and optimize observability using tools like Datadog, Cloud Watch, Prometheus/Grafana, and Pager Duty
- Working closely with the Dev teams, you will be diagnosing site issues, mitigating impact, and restoring system reliability while communicating clearly with stakeholders
- Lead incident response, root cause analysis, and post‑incident reviews
- Participate in on‑call rotations and support 24/7 production environments
- Architect and implement fully automated, fleeting, and immutable AWS production and lower environments
- Design scalable, resilient distributed systems using AWS best practices
- Eliminate manual processes through Infrastructure as Code (Terraform, Ansible, Packer)
- Build and maintain CI/CD and Git Ops workflows (Jenkins, Git Hub Actions, Git Lab CI, ArgoCD/Flux)
- Develop automation and tooling using Python and Bash to reduce operational toil
- Deploy and manage AWS services including EKS, ECS, Fargate, Lambda, and RDS (Aurora Postgre
SQL), Open search, Redis, Elasticache - Design and manage networking components such as Transit Gateways, load balancers, and service meshes
- Implement caching, microservices, and distributed system design patterns
- Architect and implement zero‑trust security models using IAM, SCPs, and OIDC
- Embed security into CI/CD pipelines using SAST/DAST tools (e.g., Snyk)
- Ensure compliance through automated auditing, backup strategies, and governance controls
- Partner with development, security, and operations teams to build reliable, observable platforms
- Document systems, runbooks, and operational procedures
- Drive Fin Ops initiatives for cost optimization and forecasting
- Integrate infrastructure changes into ITIL‑compliant workflows (e.g., Fresh service)
- Influence architectural decisions and promote engineering best practices across teams
- 6–10+ years of experience in Site Reliability Engineering, Dev Ops, or Cloud Engineering roles
- Deep hands‑on expertise with AWS services and cloud architecture
- Strong Linux systems engineering experience (Ubuntu preferred)
- Proven experience with Infrastructure as Code (Terraform, Ansible, etc.)
- Experience building and maintaining CI/CD pipelines
- Proficiency in scripting/programming (Python, Bash)
- Hands‑on experience with monitoring and observability platforms
- Solid understanding of cloud security principles (IAM, KMS, Secrets Management, Ansible Vault, Hashicorp Vault)
- Bachelor’s degree or equivalent practical experience
- Experience with containerization and orchestration (Docker, Kubernetes, EKS/ECS)
- Familiarity with Git Ops tools such as ArgoCD or Flux
- Experience with SAST/DAST tools and secure SDLC practices
- Knowledge of distributed systems, caching, and microservices architectures
- Experience with Fin Ops and cost optimization strategies
- Exposure to ITIL processes and service management platforms
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).