AWS Cloud Ops SRE Job New York New York USA,IT/Tech

Location: New York

Job Description

AWS Cloud Operations / Site Reliability Engineer (SRE) is responsible for delivering secure, reliable, and scalable cloud infrastructure. This role covers Infrastructure as a Service, AWS platform release activities, AMI lifecycle management, patching, infrastructure design documentation, terraform scripting and maintaining visibility into the application layer and how it functions in production environments. Experience with Harness for Dev Ops pipelines is a strong plus.

Key

and Must Have Skills

Terraform IaC as mandatory skills
EKS – Container management as mandatory skills
Troubleshooting skills during priority incident
Base skill preferably from Linux and Windows

Required Qualifications

10+ years in SRE, Cloud Ops, or Dev Ops with heavy AWS experience.
Strong hands‑on experience with AWS compute (EC2, ASG, EKS/ECS, Lambda)
Networking (VPC, Route 53, SG/NACL, ALB/NLB)
Storage (S3, EBS, EFS)
Databases (RDS, Aurora, Dynamo

DB)
Expertise in AMI pipeline management, image building, and OS level hardening.
Solid experience with Terraform or Cloud Formation for IaC.
Demonstrated ability to troubleshoot AWS and application stack issues end‑to‑end.
AWS Platform Operations & Releases
Own and execute AWS platform release management across environments, including validation, regression checks, and readiness reviews.
Operate and evolve AWS core services: VPC, IAM, KMS, Route 53, networking baselines, proxy layers, and organizational guardrails.
Infrastructure as a Service (IaS) using Terraform
Build, manage, and scale cloud infrastructure using Terraform as primary IaC tooling.
Create reusable Terraform modules covering networking, compute, storage, EKS, and security.
Ensure IaC follows best practices—versioned, immutable, peer reviewed, and automated through CI/CD.
Amazon EKS (Kubernetes) Operations
Deploy, manage, and maintain production‑grade AWS EKS clusters, node groups, and cluster add‑ons.
Implement Kubernetes platform standards for security, networking, name spaces, RBAC, and secrets management.
Work closely with application teams to ensure workloads run reliably and securely within EKS.
Optimize cluster scaling, workload scheduling, resource limits, and performance tuning.
AMI Lifecycle & Image Management
Manage complete AMI lifecycle: creation, CIS hardening, vulnerability scanning, tagging, publishing, and deprecation.
Build automated AMI pipelines using image builders, Packer (if applicable), and validation workflows.
Maintain golden images for EC2 fleets, containers, and hybrid workloads.
VIT (Vulnerability / Integration / Integrity Testing) & Patch Management
Lead VIT process including vulnerability assessments, remediation workflows, compliance tracking, and closure.
Own OS level and image patching using AWS Systems Manager (SSM) Patch Manager and automated maintenance windows.
Generate patch baselines, dashboards, compliance reports, and ensure measurable SLA adherence.
Observability & Application Layer Insights
Build and maintain observability stack with Cloud Watch, X Ray, Open Telemetry, and log analytics.
Establish deep visibility into application behavior, dependencies, performance, and error patterns.
Create “golden signals” dashboards covering latency, traffic, errors, and saturation for both infrastructure and applications.
CI/CD & Dev Ops Automation
Implement and maintain CI/CD pipelines for infrastructure and application deployments.
Harness experience is an added advantage, leveraging workflows, verification steps, and deployment strategies (canary, blue/green).
Integrate Terraform, AMI pipelines, EKS updates, and patch automation into CI/CD systems.
Reliability Engineering & Incident Response
Participate in on‑call rotation; lead incident triage and root‑cause analysis.
Build automation and runbooks to reduce operational toil.
Drive architectural improvements to increase availability, resilience, and performance.
Documentation & Architecture
Produce high‑quality Infrastructure Design Documents (IDDs), runbooks, DR procedures, release notes, and architectural diagrams.
Conduct operational readiness reviews, capacity planning, and cost‑optimization assessments.

Salary Range

$100,000–$120,000 a year

Qualifications

BACHELOR OF COMPUTER SCIENCE

#J-18808-Ljbffr