SRE Engineer
Listed on 2026-04-28
-
IT/Tech
Systems Engineer, Cloud Computing, SRE/Site Reliability, Cybersecurity
Overview
Spatial Front, Inc. (SFI), a two-time USAToday Top Workplaces awardee and Washington Top Workplaces honoree, is seeking a SRE Engineer to support our growing team. The SRE Engineer will support the Infrastructure, Production, and Compliance Support (IPCS) team within the enabling rail of a large-scale federal enterprise program. This role is responsible for improving the reliability, availability, performance, observability, and operational resilience of mission-critical systems supporting a complex, multi-environment ecosystem across development, test, training, and production.
The SRE Engineer will help standardize and mature reliability engineering practices across a highly integrated environment that includes People Soft-based enterprise applications, Oracle platforms, shared services, Dev Sec Ops pipelines, and reporting/integration services operating in regulated NIPRNET and SIPRNET contexts. This position works closely with platform engineers, Dev Ops, release management, cybersecurity, test automation, and product teams to reduce operational toil, strengthen production readiness, improve incident response, and support continuous delivery without compromising stability or compliance.
As a valued member of the SFI team, you will play a critical role in delivering mission-critical capabilities to our Federal Government customers.
Work EnvironmentOn-site
Key Responsibilities- Define, implement, and maintain site reliability engineering practices for mission-critical applications and shared services, with emphasis on uptime, resiliency, recoverability, and operational excellence.
- Establish and manage Service Level Indicators (SLIs), Service Level Objectives (SLOs), and error budgets for critical services and environments.
- Implement and maintain monitoring, alerting, and observability solutions for production systems.
- Support production and pre-production operations across development, test, training, staging, and production environments.
- Lead incident response activities, conducting root cause analysis and implementing permanent fixes.
- Support capacity planning, performance analysis, trend monitoring, and scalability planning for enterprise platforms and services.
- Create and maintain runbooks, standard operating procedures, incident playbooks, operational dashboards, and knowledge articles.
- Support high availability, disaster recovery, backup/restore validation, and business continuity activities.
- Develop and implement automation to reduce manual operational toil and improve system reliability.
- Contribute to post-deployment validation, smoke testing, rollback readiness, and environment health checks during releases and maintenance windows.
- Collaborate with teams supporting Oracle/People Soft platforms, integration services, reporting services, and shared enterprise tooling to improve reliability end to end.
- Collaborate with development teams to improve system reliability through design reviews and reliability engineering practices.
- Perform capacity planning and performance optimization for production systems.
- Other duties as assigned.
- Bachelor's in Computer Science, Engineering, or related field.
- 5 years of software engineering, 3 years site reliability engineering, production support engineering, or platform reliability for enterprise systems, 1 year Unix/Solaris experience.
- Experience supporting enterprise applications in a high-availability, security-conscious, and compliance-driven environment.
- Experience creating operational documentation, runbooks, and incident response procedures.
- Strong troubleshooting skills across application, middleware, integration, and infrastructure layers
- Strong verbal and written communication skills, including the ability to work across engineering, security, testing, and program stakeholders.
- Demonstrated expertise in:
Site reliability engineering, monitoring, automation, incident response, performance optimization; experienced with UNIX/Solaris. - Must be a U.S. Citizen.
- Must possess an active Secret security clearance or be able to obtain one.
- Dev Ops Engineer or equivalent SRE certification.
- Experience supporting environments subject to RMF, STIG, audit, ATO, or similar compliance requirements.
- Experience with Splunk, enterprise monitoring/observability tooling, or similar operational analytics platforms.
- Experience supporting Oracle-based enterprise environments, including Oracle middleware, Oracle Database, or related platform services.
- Experience supporting People Soft or similarly complex ERP / HCM / payroll platforms.
- Exposure to F5, Oracle Data Guard, Oracle Golden Gate, Kafka, or other enterprise integration/traffic/replication technologies.
- Familiarity with scripting and automation using tools such as Shell, Python, or Power Shell.
- Knowledge of Dev Ops, testing, and scanning tools esp. within the People Soft environment, such as PHIRE, PFT, Tricentis, Palo Alto, CAST, etc.
- Experience as an SRE supporting DoD or federal agency programs.
- Familiarity with…
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).