Lead Site Reliability Engineer Job Washington area,District of Columbia USA,IT/Tech

About Bridge Defense.

Bridge Defense is redefining how modern defense technology is delivered. Based in Washington, D.C., we are built for the dynamic mission environment facing the Department of Defense, the Intelligence Community, and federal law enforcement agencies. We provide full-spectrum national security solutions that combine secure infrastructure, cleared talent, and mission‑ready software to meet evolving defense challenges. Our services include secure software development in classified environments and the design and implementation of advanced IT and cybersecurity capabilities ranging from secure cloud architectures and enterprise infrastructure to data center operations, scientific analysis, and cutting‑edge cyber defense.

We are led by technologists and veterans with firsthand mission experience, which enables us to understand both the operational realities and the innovation needed to succeed. Our approach is agile and outcome‑based, delivering results in weeks rather than months whenever possible.

At Bridge Defense we value people, integrity, We foster an environment where innovation thrives in support of traditional mission requirements. Our team members receive competitive compensation, robust benefits, professional development and certification opportunities, and clear paths for growth while working on the nation’s most critical projects.

Core Values:

Innovation & Responsiveness:
We push beyond legacy models with efficient, tech‑led solutions built to scale and evolve.
Trusted Performance:
Security, compliance, and deep experience in delivering to demanding environments guides all we do.
Mission Focused Expertise:
From veteran leadership to cleared engineers, our people understand both the technology and the mission.

About the Role

As the Lead Site Reliability Engineer for our Compute Bridge Engagement, you’ll be responsible for the reliability, scalability, and performance of one of the largest hardware and AI infrastructure efforts in the U.S. defense sector. You will lead the deployment, management, and automation of a high‑performance computing mesh across multiple secure environments, ensuring operational excellence and mission continuity for a 9‑figure government program.

This is a hands‑on engineering leadership role that bridges physical infrastructure and modern Dev Ops automation, ideal for someone who thrives at the intersection of hardware systems, distributed computing, and AI/ML workflows.

What You’ll Do

Lead infrastructure design, deployment, and operations for Compute Bridge hardware clusters across secure and distributed environments
Install and configure physical systems, including high‑density GPU servers, networking gear, and storage arrays
Build and deploy secure Linux images and containerized workloads using Open Shift and other orchestration platforms
Develop and manage automation pipelines for provisioning, configuration management, and monitoring using modern Dev Ops tool chains (Ansible, Terraform, etc.)
Operate and maintain distributed networking meshes across multiple classified and unclassified domains
Implement and manage out‑of‑band management tools (IMPI, iDRAC, BMC, etc.) for remote troubleshooting and control
Integrate and optimize NVIDIA GPU infrastructure for AI/ML training and inference workloads
Collaborate with mission engineers, software teams, and government operators to ensure system readiness and performance
Provide on‑site technical leadership for deployments, troubleshooting, and continuous improvement
Mentor junior engineers and establish operational best practices across the Compute Bridge program as the contract grows

What You’ll Bring

3+ years of experience in site reliability, systems engineering, or hardware operations roles
Deep expertise with physical infrastructure: server racking, cabling, diagnostics, and troubleshooting
Strong experience with Linux systems administration, imaging, and automated deployment
Hands‑on experience managing large‑scale clusters or distributed systems in Open Shift or Kubernetes environments
Familiarity with Dev Ops automation (Ansible, Terraform, CI/CD pipelines)
Experience configuring and managing networking and…


Increase/decrease your Search Radius (miles)



Job Posting Language