Lead CloudOps Engineer
Listed on 2026-02-12
-
IT/Tech
Cloud Computing, Systems Engineer, SRE/Site Reliability, IT Support
We are looking for a hands‑on Lead Cloud Ops Engineer to oversee the reliability, scalability, automation, and day‑to‑day operations of our GCP‑based cloud platform. You will drive infrastructure automation, improve developer workflows, enhance observability, and ensure secure, stable platform operations.
While GCP is the primary environment, the role includes operational responsibility for an existing AWS enterprise environment, requiring the ability to troubleshoot issues, maintain existing systems, and support partner teams without owning major AWS architectural redesigns.
This position is ideal for someone who thrives in cloud‑native environments, enjoys automation, and balances engineering rigor with operational excellence.
This is a founding member of the Cloud Ops team in the US and has a potential to grow into future leadership and management positions.
Responsibilities: GCP Platform Operations & Engineering- Lead day‑to‑day monitoring and management of GCP infrastructure, focusing on reliability, uptime, security, performance, and compliance
- Manage GKE clusters, including cluster lifecycle, node pools, workload deployment, and operational best practices
- Implement and maintain GCP networking: VPCs, firewall rules, service networking, and private connectivity
- Support data and application teams using GCP services such as Big Query, Cloud SQL, Pub/Sub, Cloud Storage, Cloud Run
- Own and maintain Terraform configurations for GCP and AWS using reusable modules, remote state, policy checks, and automation pipelines
- Automate environment provisioning, scaling, and configuration with CI/CD tools such as Cloud Build, Git Hub Actions, ArgoCD, or Jenkins
- Build tooling and workflows that improve developer productivity (automated builds, deployments, secrets management, ephemerally mentioned environments)
- Build and enhance observability stacks using Cloud Monitoring, Prometheus/Grafana, ELK/Elastic, or Open Telemetry
- Lead incident response, troubleshooting, RCA generation, and post‑incident improvement efforts
- Define and manage SLOs, error budgets, and operational runbooks
- Ensure secure configurations across cloud services, Kubernetes workloads, secrets storage, and network boundaries
- Implement guardrails and compliance automation using IAM best practices, GCP Organization Policies, and Terraform checks
- Work with security and compliance teams to meet HIPAA, HITRUST, SOC 2, or internal audit requirements
- Maintain stability of a pre‑existing AWS environment by performing tasks such as:
- Reviewing IAM roles and security posture
- Supporting workloads on EC2, ECS, EKS, RDS, S3
- Troubleshooting infrastructure or networking issues
- Managing configurations, upgrades, and patching
- Assist teams that rely on AWS‑hosted systems and ensure smooth integration with GCP‑centric operations
- Make small‑to‑medium improvements or automation updates for AWS infrastructure using Terraform or CI/CD workflows
- Mentor Dev Ops, Cloud Ops, and Platform Engineers through pair programming, reviews, and best‑practice sharing
- Partner with development, data, and security teams to build highly reliable, cloud‑native applications and pipelines
- Establish operational standards, documentation, and playbooks for cloud operations
- 8 years of Dev Ops, Cloud Ops, or platform engineering experience
- Deep hands‑on experience with GCP, including: GKE, workload identity, cluster networking, VPC design, firewalls, load balancers, Big Query, Pub/Sub, Cloud SQL, Cloud Storage, Cloud Run, Cloud Functions, IAM, KMS, Secret Manager
- Strong expertise with Terraform, including modules, work spaces, and governance patterns
- Strong CI/CD experience with Git‑based workflows and pipeline automation
- Solid understanding of Linux, networking basics, containerization, and distributed systems
- Experience supporting production workloads in a regulated environment (HIPAA, HITRUST, SOC 2 or similar)
- Practical experience supporting AWS…
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).