Senior Cloud Engineer; AWS/Azure/GCP - VP Job New York New York USA,IT/Tech

Position: Senior Cloud Engineer (AWS / Azure / GCP) - VP
Location: New York

Role Summary

We are seeking a Senior Cloud Engineer / Site Reliability Engineer (SRE) to design, build, and operate secure, scalable cloud platforms across AWS, Azure, and GCP. This role is responsible for configuring, deploying, and maintaining virtual machines and containerized applications, using Terraform to automate infrastructure provisioning and lifecycle management. You will provide specialized support for high‑stakes production deployments, lead incident response for technical escalations, and apply SRE principles (SLIs/SLOs, error budgets, automation, and reliability engineering) to improve availability, performance, and operational excellence in a multi‑cloud environment.

Key Responsibilities Cloud Platform Engineering (AWS / Azure / GCP)

Architect, implement, and maintain cloud infrastructure across AWS, Azure, and GCP using Terraform (IaC).
Design and implement cloud landing zones aligned with best practices:
- Account/subscription/project structure, environment separation, identity boundaries
- Baseline guardrails and policy enforcement (Azure Policy, AWS Organizations/SCPs, GCP Org Policies)
- Centralized audit logging, monitoring, and cost allocation standards
Build and operate cloud‑native virtual network constructs (cloud‑focused only):
- Azure: VNETs, subnets, NSGs, route tables, Private Endpoints, hub/spoke patterns.
- AWS: VPCs, subnets, security groups, NACLs, route tables, VPC endpoints/Private Link, multi‑account connectivity patterns.
- GCP: VPC networks, subnets, firewall rules, routes, Private Service Connect, Shared VPC patterns.
Implement private‑by‑default service access patterns (private endpoints, controlled egress, service‑to‑service access controls).

Compute, Virtual Machines, and Containers

Configure, deploy, and maintain virtual machines and scalable compute patterns:
- AWS EC2 (Launch Templates, Auto Scaling Groups)
- Azure Virtual Machines / VM Scale Sets
- GCP Compute Engine / Managed Instance Groups
Own OS hardening, baseline configuration, patching strategies, and instance bootstrapping (cloud‑init, image pipelines).
Deploy and operate containerized workloads using Kubernetes:
- EKS / AKS / GKE (cluster design, upgrades, node pools, RBAC, scaling)
- Container registries (ECR / ACR / Artifact Registry) and artifact promotion strategies
Implement workload delivery patterns (Helm/Kustomize), rollout strategies (blue/green, canary), and safe rollbacks.

Infrastructure as Code, Automation & CI/CD (Terraform)

Build reusable, versioned Terraform modules with standards for naming, tagging/labels, and secure defaults.
Implement Terraform best practices: remote state, locking, environment isolation, secrets handling, and drift detection.
Integrate IaC into CI/CD pipelines (e.g., Git Hub Actions, Azure Dev Ops, Git Lab CI):
- Automated validation, linting, security scanning, plan/apply workflows, approvals, and promotions
Implement policy‑as‑code guardrails (OPA/Conftest, Sentinel where applicable) to prevent unsafe changes.

SRE:
Reliability Engineering, Observability & Operational Excellence

Define, implement, and improve SLIs/SLOs (availability, latency, error rates, saturation) for critical services and platforms.
Manage and enforce error budgets to balance reliability with delivery velocity.
Establish and continuously improve observability standards:
- Metrics, logs, traces, dashboards, and alerting across cloud services and Kubernetes
- Tooling such as Cloud Watch, Azure Monitor/Log Analytics, GCP Cloud Monitoring/Logging, Open Telemetry, Prometheus/Grafana (where used)
Improve incident detection quality by reducing alert noise, implementing actionable alerts, and creating clear escalation paths.
Drive reliability improvements through:
- Capacity planning, performance tuning, load testing support
- Resilience engineering (multi‑zone design, graceful degradation, retries/timeouts, back pressure)
- Continuous automation to eliminate toil (self‑healing, auto‑remediation runbooks, Chat Ops where applicable)

Production Support, Incident Response & Escalations

Provide specialized support for high‑stakes production deployments (major releases, platform cutovers, migrations).
Lead incident response: triage, mitigation, recovery,…

Senior Cloud Engineer; AWS​/Azure​/GCP - VP

Senior Cloud Engineer; AWS/Azure/GCP - VP