×
Register Here to Apply for Jobs or Post Jobs. X

Cloud SRE Lead – Major Incident and Digital Transformation

Job in Reston, Fairfax County, Virginia, 22090, USA
Listing for: Tandym Tech
Full Time position
Listed on 2026-01-01
Job specializations:
  • IT/Tech
    Cloud Computing, IT Project Manager
Salary/Wage Range or Industry Benchmark: 60000 - 80000 USD Yearly USD 60000.00 80000.00 YEAR
Job Description & How to Apply Below

A recognized services company is actively seeking a new Cloud SRE Lead to join their team. In this role, the Cloud SRE Lead will be responsible for ensuring the reliability, scalability and performance of the company’s cloud infrastructure on Amazon Web Services (AWS) and guide the daily activities of the SRE team.

About the Opportunity:
  • Must be able to obtain and maintain the required agency clearance (6C Public Trust)

Responsibilities:
  • Execute ideation sessions across multiple teams and companies to identify areas of improvement and ideas to improve and radically change the current incident management process

  • Review of currently available tools and industry best-of-breed to recommend and champion the right tool and technology and the right capabilities to empower, visualize, communicate, and activate cross functional teams

  • Coordinate and lead the Major Incidents by directing the troubleshooting, communicating status, encouraging action, guiding the use of tools, and ensuring swift and complete resolution of the Major Incident

  • Schedule and lead blameless postmortems encouraging independent ideas, identification of true root causes, and communication of findings

  • Design, implement, and manage infrastructure as code (IaC) solutions using tools like AWS Cloud Formation, Terraform or Helm Charts to automate deployment and scaling processes

  • Implement robust monitoring and alerting systems to proactively identify and address potential issues before they impact system performance

  • Conduct performance analysis and optimization of AWS infrastructure components to enhance system efficiency and reduce latency

  • Participate in on-call rotations to respond to and resolve incidents promptly

  • Work closely with security teams to implement and enforce best practices for securing AWS environments

  • Facilitate clear communication across teams, providing updates on release status, known issues, and any potential impact on stakeholders

  • Collaborate with development, QA, and operations teams to plan and coordinate software releases

  • Develop and maintain automated deployment pipelines using industry-standard tools such as AWS Cl/CD, Git Lab CI/CD, Jenkins or similar

Qualifications:
  • 5+ years of related work experience

  • Bachelor’s Degree

  • Proven experience as a Site Reliability Engineer or similar role

  • In-depth knowledge of AWS services and expertise in managing cloud infrastructure

  • Proven experience in a Digital Transformation role

  • Advanced level programming and/or scripting in 3 or more of the following languages:
    Python, Java, Chef, Helm, Playwright, Bash, JavaScript, Terraform.

  • Strong understanding of Dev Ops principles and continuous integration/continuous deployment (CI/CD) pipelines

  • Proficiency in CI/CD tools such as AWS CI/CD, Git Lab CI/CD, or others

  • Familiarity with infrastructure as code (IaC) tools like Cloud Formation, Terraform, Helm Charts, Morpheus, or similar technologies

  • Hands-on experience with version control systems (Git Lab, AWS Code Commit, SVN) and branching strategies

  • Experience with containerization and orchestration tools (e.g., Amazon Elastic Compute Service (ECS), Amazon Elastic Kubernetes Service (EKS), Docker, Kubernetes).

  • Familiarity with monitoring tools (e.g., Cloud Watch, Prometheus, Grafana, Datadog, Dyna Trace) and log analysis

  • Solid understanding of Agile methodologies and their application in release management and Cloud operations

  • Excellent problem-solving and troubleshooting skills

  • Strong communication and collaboration skills

Desired

Skills:
  • 3+ years in SRE or Platform Engineering group for high availability/critical platforms/applications

  • Relevant certifications in Dev Ops or related fields

  • High Risk Public Trust or Secret Clearance

  • Experience managing a distributed container platform including but not limited to deployment/release management, provisioning, capacity management, workload management

#J-18808-Ljbffr
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
 
 
 
Search for further Jobs Here:
(Try combinations for better Results! Or enter less keywords for broader Results)
Location
Increase/decrease your Search Radius (miles)

Job Posting Language
Employment Category
Education (minimum level)
Filters
Education Level
Experience Level (years)
Posted in last:
Salary