×
Register Here to Apply for Jobs or Post Jobs. X

Senior Cloud Engineer; AWS​/Azure​/GCP - VP

Job in New York, New York County, New York, 10261, USA
Listing for: Morgan Stanley
Full Time position
Listed on 2026-05-16
Job specializations:
  • IT/Tech
    Cloud Computing, Systems Engineer, IT Support, SRE/Site Reliability
Salary/Wage Range or Industry Benchmark: 80000 - 100000 USD Yearly USD 80000.00 100000.00 YEAR
Job Description & How to Apply Below
Position: Senior Cloud Engineer (AWS / Azure / GCP) - VP
Location: New York

Role Summary

We are seeking a Senior Cloud Engineer / Site Reliability Engineer (SRE) to design, build, and operate secure, scalable cloud platforms across AWS, Azure, and GCP. This role is responsible for configuring, deploying, and maintaining virtual machines and containerized applications, using Terraform to automate infrastructure provisioning and lifecycle management. You will provide specialized support for high‑stakes production deployments, lead incident response for technical escalations, and apply SRE principles (SLIs/SLOs, error budgets, automation, and reliability engineering) to improve availability, performance, and operational excellence in a multi‑cloud environment.

Key Responsibilities Cloud Platform Engineering (AWS / Azure / GCP)
  • Architect, implement, and maintain cloud infrastructure across AWS, Azure, and GCP using Terraform (IaC).
  • Design and implement cloud landing zones aligned with best practices:
    • Account/subscription/project structure, environment separation, identity boundaries
    • Baseline guardrails and policy enforcement (Azure Policy, AWS Organizations/SCPs, GCP Org Policies)
    • Centralized audit logging, monitoring, and cost allocation standards
  • Build and operate cloud‑native virtual network constructs (cloud‑focused only):
    • Azure: VNETs, subnets, NSGs, route tables, Private Endpoints, hub/spoke patterns.
    • AWS: VPCs, subnets, security groups, NACLs, route tables, VPC endpoints/Private Link, multi‑account connectivity patterns.
    • GCP: VPC networks, subnets, firewall rules, routes, Private Service Connect, Shared VPC patterns.
  • Implement private‑by‑default service access patterns (private endpoints, controlled egress, service‑to‑service access controls).
Compute, Virtual Machines, and Containers
  • Configure, deploy, and maintain virtual machines and scalable compute patterns:
    • AWS EC2 (Launch Templates, Auto Scaling Groups)
    • Azure Virtual Machines / VM Scale Sets
    • GCP Compute Engine / Managed Instance Groups
  • Own OS hardening, baseline configuration, patching strategies, and instance bootstrapping (cloud‑init, image pipelines).
  • Deploy and operate containerized workloads using Kubernetes:
    • EKS / AKS / GKE (cluster design, upgrades, node pools, RBAC, scaling)
    • Container registries (ECR / ACR / Artifact Registry) and artifact promotion strategies
  • Implement workload delivery patterns (Helm/Kustomize), rollout strategies (blue/green, canary), and safe rollbacks.
Infrastructure as Code, Automation & CI/CD (Terraform)
  • Build reusable, versioned Terraform modules with standards for naming, tagging/labels, and secure defaults.
  • Implement Terraform best practices: remote state, locking, environment isolation, secrets handling, and drift detection.
  • Integrate IaC into CI/CD pipelines (e.g., Git Hub Actions, Azure Dev Ops, Git Lab CI):
    • Automated validation, linting, security scanning, plan/apply workflows, approvals, and promotions
  • Implement policy‑as‑code guardrails (OPA/Conftest, Sentinel where applicable) to prevent unsafe changes.
SRE:
Reliability Engineering, Observability & Operational Excellence
  • Define, implement, and improve SLIs/SLOs (availability, latency, error rates, saturation) for critical services and platforms.
  • Manage and enforce error budgets to balance reliability with delivery velocity.
  • Establish and continuously improve observability standards:
    • Metrics, logs, traces, dashboards, and alerting across cloud services and Kubernetes
    • Tooling such as Cloud Watch, Azure Monitor/Log Analytics, GCP Cloud Monitoring/Logging, Open Telemetry, Prometheus/Grafana (where used)
  • Improve incident detection quality by reducing alert noise, implementing actionable alerts, and creating clear escalation paths.
  • Drive reliability improvements through:
    • Capacity planning, performance tuning, load testing support
    • Resilience engineering (multi‑zone design, graceful degradation, retries/timeouts, back pressure)
    • Continuous automation to eliminate toil (self‑healing, auto‑remediation runbooks, Chat Ops where applicable)
Production Support, Incident Response & Escalations
  • Provide specialized support for high‑stakes production deployments (major releases, platform cutovers, migrations).
  • Lead incident response: triage, mitigation, recovery,…
Position Requirements
10+ Years work experience
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
 
 
 
Search for further Jobs Here:
(Try combinations for better Results! Or enter less keywords for broader Results)
Location
Increase/decrease your Search Radius (miles)
0
200
Filters
Education Level
Experience Level (years)
Posted in last:
Salary