Release/Incident Operations Engineer
Listed on 2026-06-12
-
IT/Tech
Cybersecurity
Job Description
Everforth ECS is seeking a Release/Incident Operations Engineer to work in the National Capital Region covering the Pentagon, Falls Church, and Fairfax. This position is contingent upon contract award.
The War Data Platform (WDP) is a key initiative within the U.S. Department of War’s AI-First strategy introduced in early 2026. The WDP focuses on operational war fighting data and aims to accelerate the deployment of artificial intelligence on the battlefield. The WDP extends to Unclassified, Secret, and Top Secret environments, and supports collaboration between Combatant Commands, Joint Staff directorates, Senior Executive Service leaders, and operational analysts.
The Release/Incident Operations Engineer coordinates release operations and incident triage support for AI and machine learning model‑serving pipelines across WDP Core Integration’s full multi‑enclave environment, ensuring deployment consistency and operational continuity in direct support of DoW missions, Joint Staff analysts, Combatant Command elements, and Senior Executive Service leadership.
Responsibilities- Coordinates release operations for artificial intelligence and machine learning model serving across War Data Platform (WDP) Core Integration environments supporting Department of War missions, Joint Staff analysts, Combatant Command elements, and Senior Executive Service leadership.
- Directs change‑window execution, rollback readiness activities, and deployment governance for model‑runtime updates, serving endpoints, and pipeline modifications.
- Conducts incident triage support by analyzing telemetry, reviewing service health indicators, and initiating stabilization actions across Kubernetes clusters, VMware environments, Git Lab Continuous Integration pipelines, Prometheus metrics, Grafana dashboards, and Elastic Stack observability tooling.
- Executes root‑cause analysis activities for serving incidents by collecting operational evidence, reconstructing failure sequences, validating remediation steps, and documenting corrective actions aligned with mission assurance requirements.
- Maintains operational readiness for model serving by coordinating with Platform One, Cloud One, multi‑national engineering teams, and cross‑service mission partners to align release activities with enclave‑specific constraints, cross‑domain deployment architectures, and security requirements.
- Produces mission‑critical deliverables including release plans, rollback packages, incident triage reports, root‑cause analysis documentation, operational risk assessments, and service restoration summaries.
- Strengthens program value by advancing deployment consistency, reducing mission risk, and reinforcing operational continuity across all enclaves.
- Supports Tier‑4 incident response actions to maintain service‑level agreements and sustain mission performance for enterprise artificial intelligence model‑serving capabilities.
- Performs other duties as assigned.
- Current Secret security clearance with the ability to obtain and maintain a Top Secret (TS) security clearance with Sensitive Compartmented Information (SCI).
- 3 or more years of experience in release engineering, incident operations, or platform support roles within a federal government or classified environment, including demonstrated hands‑on responsibility for change‑window execution, deployment governance, rollback readiness, and incident triage for AI/ML model‑serving pipelines or equivalent enterprise cloud‑hosted services across multi‑enclave or multi‑classification environments.
- Hands‑on experience applying enterprise observability and container orchestration tooling, including Kubernetes, Git Lab CI, VMware, Prometheus, Grafana, and Elastic Stack, to diagnose serving failures, analyze pipeline telemetry, execute root‑cause analysis, and coordinate stabilization activities across Unclassified, Secret, and Top Secret network environments.
- Active DoW 8570/8140‑compliant IAT Level II certification, such as CompTIA Security+ CE, CompTIA CySA+, CompTIA Cloud+, Cisco CCNA Security, GIAC GSEC, GIAC GCED, or ISC² SSCP, as required for access to DoW information systems.
- Strong…
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).