Reliability Engineer
Listed on 2026-01-27
-
IT/Tech
IT Support, Cloud Computing, Systems Engineer, Systems Administrator
Why do you need to choose between doing important work and having a fulfilling life?
At Ardent
, we have both. Ardent employees are committed to solving our customers’ most difficult problems—and we are committed to the well‑being, personal goals, and professional development of our employees. We are “All In.” We put forth our strongest effort possible to get the mission accomplished and we do it together. We respect the skills and experience you bring to the Ardent team.
And we provide a rewarding environment to help you succeed.
We offer highly competitive benefits, professional development opportunities, and an exceptional culture that embraces flexibility, innovation, collaboration, and career growth. A collective service mindset underpins our work, and a shared camaraderie to serve clients, colleagues, and our communities set us apart. Our full commitment to being "All In" for our employees and our clients is not just our approach, it is our standard.
If this sounds like the perfect fit for you, choose Ardent and make a difference with us.
Ardent is seeking a Reliability Engineer to join our team.
This is an onsite role in Ashburn, VA. Must be open to working 2nd or 3rd shift in a 24/7/365 environment.
Position DescriptionWe are seeking a skilled Reliability Engineer to support our client’s mission by enhancing Production Monitoring and ensuring optimal service delivery for their applications. This role involves proactive issue identification, incident resolution, and system health optimization within a 24x7x365 operational environment. The ideal candidate will lead monitoring solutions, manage ITIL engineers, automate processes, and collaborate across IT and business teams to improve service reliability.
Expertise in AWS environments, root cause analysis, and technical troubleshooting is essential along with strong communication and leadership skills to drive continuous improvement.
- Experience in Production Monitoring & Support within a 24x7x365 operational environment.
- Strong expertise in incident management, root cause analysis, and problem resolution for cloud‑based applications.
- Hands‑on experience with Amazon Web Services (AWS) and cloud‑based monitoring tools.
- Proficiency in ITIL processes and managing ITIL engineers for efficient service delivery.
- Ability to build and implement monitoring solutions, automate manual processes, and create alerts to ensure system stability.
- Experience with system health monitoring, performance optimization, and troubleshooting production issues.
- Strong leadership skills to collaborate with IT, business, and infrastructure teams to improve production support processes.
- Effective communication skills to provide updates, incident reports, and status updates to leadership and stakeholders.
- Ability to develop and maintain technical documentation and knowledge‑base resources for production support.
- Experience in triaging and resolving production incidents; assessing severity and properly escalating issues.
- Active CBP/BI or Top Secret clearance highly desired.
- Must be open to working 2nd or 3rd shift in a 24/7/365 environment.
- U.S. Citizenship required; willing to undergo a government‑issued background investigation.
- Proactive and early notification of potential and actual issues impacting service delivery.
- Frequent and succinct communication to PSPD leadership during and post incident.
- Identification of trends and corrective measures.
- Provide needed metrics to PSPD leadership team.
- Support operations 24x7x365 by providing additional technical support and diagnosis.
- Build monitoring and production support solutions to provide customer visibility toward our services.
- Triage and resolve production incidents related to the cloud platform; participate in root cause analysis and post‑mortem discussions.
- Lead short‑term and long‑term solutions, automate manual processes, and build alerts to monitor the operation of services.
- Assess initial severity, gather impacts, create tickets, engage support teams, and escape issues properly as they arrive.
- Participate in the creation and maintenance of technical and knowledge‑base documentation.
- Troubleshoot…
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).