Senior Java Site Reliability Engineer Job McLean area,Virginia USA,IT/Tech

Role:
Senior Java Site Reliability Engineer

Exp: 16-20 Years

Job Type: Contract

Project:
Hybrid

Location:
McLean, VA

Key Responsibilities

Support and maintain highly available production platforms across cloud and distributed environments. Drive incident management, root cause analysis, problem management, and platform stability initiatives.
Monitor and maintain uptime of Java applications and microservices.
Proactively identify and resolve application performance bottlenecks.
Conduct root cause analysis (RCA) for application outages and incidents.
Implement resiliency patterns including circuit breakers, retries, and failover mechanisms.
Lead reliability engineering efforts focused on system availability, performance optimization, and operational excellence. Implement and enhance observability solutions including monitoring, logging, alerting, and incident response automation.
Collaborate with development, infrastructure, and cloud engineering teams to improve deployment reliability and operational efficiency. Support infrastructure modernization, cloud transformation, and platform automation initiatives.
Coordinate disaster recovery testing, resiliency validation, capacity planning, and production readiness reviews. Provide technical leadership and mentor offshore/onshore engineering teams.

Required Experience

16–20 years of experience in Site Reliability Engineering (SRE), Production Engineering, Platform Engineering, or Application Support.
Strong experience supporting large-scale enterprise production environments. Proven background in incident management, problem management, and operational support.
Experience working within banking, financial services, fintech, or other highly regulated industries. Hands‑on experience supporting mission‑critical applications with stringent availability and performance requirements.

Required Skills

Java
Kubernetes and Container Platforms
Docker
Cloud Platforms (AWS, Azure, or GCP)
CI/CD Tools (Jenkins, Git Hub Actions, Git Lab CI/CD, ArgoCD)
Infrastructure as Code (Terraform, Ansible)
Monitoring & Observability Tools (Splunk, Datadog, Grafana, Prometheus, Moogsoft)
Service Now, JIRA, Confluence
Python, Bash, or Shell Scripting
SQL and Database Troubleshooting
Application Performance Monitoring (APM)
Production Release Management
Disaster Recovery and High Availability Architectures

Education

Bachelor's degree in Computer Science, Information Systems, Engineering, or a related technical discipline.

#J-18808-Ljbffr