×
Register Here to Apply for Jobs or Post Jobs. X

Sys​/Cloud Admin​/Incident Response Engineer

Job in Millersville, Anne Arundel County, Maryland, 21108, USA
Listing for: i4DM
Full Time position
Listed on 2026-06-26
Job specializations:
  • IT/Tech
    IT Support, Systems Administrator, SRE/Site Reliability, Cybersecurity
Salary/Wage Range or Industry Benchmark: 85000 - 110000 USD Yearly USD 85000.00 110000.00 YEAR
Job Description & How to Apply Below

About Our Team

Our employees thrive in a culture that is fast‑paced, collaborative, and ego‑free, where innovation and teamwork are encouraged at every level. We provide Federal agencies with immediate access to highly skilled professionals who understand complex mission challenges and deliver efficient, scalable solutions. By continuously investing in talent, technology, and specialized capabilities, we maintain expert teams prepared to support evolving Federal missions through tailored technical solutions and modern service delivery approaches.

Description

We value diverse perspectives and strive to attract talent from all backgrounds. We are seeking professionals who are passionate about technology, mission success, and solving complex operational challenges with creativity and purpose. If you enjoy expanding your technical expertise while supporting impactful Federal initiatives, you will thrive within our organization. Veterans and military spouses are strongly encouraged to apply and bring their valuable experience to our team.

About

The Role

We are seeking an experienced and highly motivated Sys/Cloud Admin/Incident Response Engineer to support enterprise monitoring operations, incident detection, response activities, and operational situational awareness for a mission‑critical platform within the Department of Veterans Affairs (VA) environment.
In this role, you will provide hands‑on administration and operational support to help ensure monitoring and incident management processes effectively sustain system reliability, operational continuity, and rapid restoration of services across a large‑scale, 24x7 enterprise healthcare platform.

Monitoring, Administration & Operational Support
  • Administer, monitor, and support cloud and platform services, virtual infrastructure, and hosted applications to maintain system health, availability, and performance.
  • Configure, tune, and maintain monitoring, logging, and alerting solutions to improve visibility across infrastructure, applications, and service dependencies.
  • Validate alert accuracy, reduce noise, and help ensure operational issues are detected proactively through effective observability practices.
  • Perform routine system administration tasks such as environment checks, service restarts, access support, patch coordination, and operational maintenance activities.
Incident Response & Service Restoration
  • Monitor incident queues and system alerts, perform initial triage, document impact, and execute defined escalation procedures for incidents affecting mission‑critical services.
  • Participate in major incident response activities, including troubleshooting, log review, coordination with engineering teams, and support for service restoration efforts.
  • Follow incident response playbooks, severity models, and communication protocols to support timely resolution and accurate status reporting.
  • Document incident timelines, actions taken, recovery steps, and supporting evidence to enable post‑incident review and continuous improvement.
Operational Coordination & Stakeholder Support
  • Support coordination during operational events by working across infrastructure, application, Dev Sec Ops , SRE, and service management teams.
  • Provide clear, timely updates on incident status, service impact, troubleshooting progress, and recovery actions to internal stakeholders.
  • Escalate issues appropriately based on impact, urgency, and established operational procedures.
  • Maintain accurate operational records in ticketing, incident, and knowledge management systems.
Observability, Automation & Continuous Improvement
  • Partner with engineers and platform teams to improve dashboards, alerts, runbooks, and operational procedures supporting reliable service delivery.
  • Identify recurring operational issues, alert gaps, and system weaknesses, and recommend practical improvements to reduce incident frequency and response time.
  • Support automation efforts for routine operational tasks, alert correlation, remediation workflows, and incident response activities where applicable.
  • Contribute to post‑incident reviews, root cause analysis activities, and implementation of corrective or preventive actions.
Reporting,…
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
 
 
 
Search for further Jobs Here:
(Try combinations for better Results! Or enter less keywords for broader Results)
Location
Increase/decrease your Search Radius (miles)
0
200
Filters
Education Level
Experience Level (years)
Posted in last:
Salary