Systems Engineer - Site Reliability Engineering Job Bethesda area,Maryland USA,IT/Tech

JOB SUMMARY:

The Systems Engineer - Site Reliability Engineering (SRE) is responsible for the reliability, scalability, and performance of mission-critical cloud and on-prem services that support millions of Marriot customers globally. This role involves overseeing incident management, driving automation efforts, and working closely with cross-functional teams to ensure alignment between SRE strategy and business objectives. Partners closely with Product Teams, Applications teams, Infrastructure, and the broader Applications and Infrastructure Delivery teams to develop key metrics and KPIs to improve applications stability, availability and performance.

The ideal candidate will bring strong communication skills, collaborating with key stakeholders across the company to optimize cloud infrastructure and uphold the highest standards of operational excellence in a dynamic, fast-paced environment.

CANDIDATE PROFILE:

Required:

* Undergraduate degree in an engineering or computer science discipline and/or equivalent experience/certification

* 5+ years of hands-on experience in designing, building and operating production grade systems including:

* 2+ years of experience as a Site Reliability Engineer (SRE), building and managing highly available and mission critical systems

* Deep understanding of SRE practices, such as Service Level Objectives, Error Budgets, Toil Management, Observability & Monitoring, Blameless Postmortems, Incident Response Process, Capacity Planning

* Expertise in AWS services including designing highly available, multi-AZ and multi-region architectures, for example:

* Compute: EC2, Auto Scaling, Lambda

* Containers: EKS (Mandatory), ECS (good to have)

* Networking: VPC, subnets, route tables, NAT gateways, Transit Gateway

* Security: IAM roles/Policies, KMS, Secret manager

* Storage and Databases: S3, EBS, EFS, RDS, Document

DB.

* Proven automation and programming experience in one or more of the following languages:
Python, Power Shell

* Experience using modern, continuous development techniques and pipelines (e.g. Agile, Kanban, Jira, CI/CD, Helm, Harness, Jenkins, Git, Artifactory, Vault)

* Experience designing and implementing end-to-end observability solutions across metrics, logs, and traces using tools like Prometheus, Grafana, ELK Stack, and Open Telemetry.

* Hands-on experience with Linux administration (RHEL, Ubuntu, CentOS, AWS Linux)

* Experience troubleshooting API-related issues in distributed systems, including latency, authentication/authorization failures, rate limiting, and upstream/downstream dependency failures.

* Experience with containerization orchestration engines such as Kubernetes (EKS, AKS, ACK)

* Familiarity with service mesh technologies to enable secure and resilient service communication, including mTLS, traffic shaping, and policy enforcement.

* Familiarity with Infrastructure as Code (Iac) tools like Terraform and Cloud Formation.

* Familiarity with configuration management and automation tools such as Ansible.

* Familiarity with vulnerability management, OS hardening, patching, security compliance of infrastructure, applications and databases

* Understanding of basic networking fundamentals

Preferred:

* Experience driving cloud cost optimization initiatives (rightsizing, reserved instances, autoscaling strategies, cost observability)

* Networking expertise including Load Balancing, Firewalls, Security Groups, NACLs, TCP/IP, DNS, HTTP/HTTPS, SSL/TLS etc

CORE WORK

ACTIVITIES:

* Ensure the reliability, availability, and performance of mission-critical cloud services, implementing best practices for monitoring, alerting, and incident management.

* Oversee the management of high-severity incidents, driving quick resolution and post-incident analysis to identify root causes and prevent recurrence.

* Drive the automation of operational processes and ensure systems can scale effectively to support growing user demand, optimizing cloud and on-prem infrastructure and resource usage.

* Develop and execute the SRE strategy aligned with business goals, and communicate service health, reliability, and performance metrics to senior leadership and stakeholders

Drive Applications Performance Management and Monitoring:

* Assess application architectures to identify key monitoring points

* Identify Key Performance Indicators, apply monitoring, and report out on compliance.

* Gather information to develop reporting metrics and KPIs

* Ensure that all applications adhere to appropriate monitoring standards based on their technology/business process

* Determine forums and cadence to provide regular monitoring updates

Building Successful Relationships:

* Collaborates with Enterprise Application and Architecture and Infrastructure teams to continuously improve processes and procedures.

* Liaises with vendors and Service Providers to select services and tools that best meet company goals

Managing Projects and Priorities:

* Develops specific goals and plans to prioritize, organize, and accomplish work.

* Champions leaders'…