×
Register Here to Apply for Jobs or Post Jobs. X

Reliability Engineer

Job in Bedford, Middlesex County, Massachusetts, 01730, USA
Listing for: Systems Engineering Solutions Corporation
Full Time position
Listed on 2026-05-26
Job specializations:
  • IT/Tech
    Systems Engineer, Cloud Computing
Salary/Wage Range or Industry Benchmark: 100000 - 125000 USD Yearly USD 100000.00 125000.00 YEAR
Job Description & How to Apply Below

This role supports the U.S. Air Force Cloud One Architecture and Common Shared Services contract and currently has an opening for a Reliability Engineer
. The Reliability Engineer is responsible for ensuring the availability, performance, scalability, and resiliency of mission critical systems. This role applies software engineering principles to infrastructure and operations, with a strong emphasis on automation, monitoring, incident response, and continuous reliability improvement. The reliability engineer serves as the bridge between development, operations, and platform teams to ensure production systems consistently meet defined service level objectives (SLOs) while supporting rapid, safe delivery of new capabilities.

Location: This position will be hybrid remote. Candidates will be required to work onsite as needed. Candidates preferred to be located near Hanscom AFB (Boston, MA).

System Reliability & Availability
  • Design, implement, and maintain highly available, fault-tolerant systems in cloud and hybrid environments
  • Define, measure, and report Service Level Indicators (SLIs), Service Level Objectives (SLOs), and error budgets
  • Identify reliability risks and implement mitigation strategies across the system lifecycle
  • Conduct capacity planning and performance modeling to ensure systems scale to meet demand
Monitoring, Observability & Alerting
  • Implement and manage monitoring, logging, and tracing solutions to provide full system observability
  • Define actionable alerting thresholds that minimize noise and enable rapid incident detection
  • Analyze trends and metrics to proactively identify potential reliability issues
Incident Response & Problem Management
  • Participate in oncall rotations and lead incident response activities for production systems
  • Coordinate troubleshooting efforts across development, infrastructure, and security teams
  • Conduct postincident reviews (PIRs) and develop corrective and preventive action plans
  • Track recurring issues and ensure root causes are resolved
Automation & Engineering Excellence
  • Automate operational tasks to reduce manual intervention and operational risk
  • Develop scripts, tools, and services that improve system reliability and reduce mean time to recovery (MTTR)
  • Promote "automation over toil" and standardize operational workflows
Reliability Focused Engineering
  • Participate in architecture and design reviews with an emphasis on reliability, resiliency, and recoverability
  • Validate disaster recovery (DR) and business continuity plans; test failover mechanisms
  • Support chaos engineering, fault injection testing, and resilience validation where appropriate
Collaboration & Governance
  • Partner with Dev Ops, Platform, and Security teams to ensure reliability aligns with delivery and compliance objectives
  • Document system reliability standards, runbooks, and operational procedures
  • Support compliance and audit activities (e.g., FedRAMP, FISMA, internal operational controls)
Required Skills
  • Bachelors and eight (8) years or more of experience;
    Masters and six (6) years or more of experience. Additional experience may be accepted in lieu of degree.
  • Active Secret clearance at a minimum required to start
  • US citizenship required
  • Experience with cloud platforms (AWS, Azure, OCI, or GCP), including managed services
  • Experience with containerized environments (Docker, Kubernetes)
  • Familiarity with CI/CD pipelines and deployment automation
  • SLOs and error budgets
  • Capacity modeling and performance testing
  • Strong understanding of:
    • Distributed systems and high availability architectures
    • Linux/Windows system administration
    • Networking fundamentals (DNS, TCP/IP, load balancing)
  • Hands-on experience with:
    • Monitoring and observability tools (e.g., Prometheus, Grafana, ELK/Elastic, Datadog, Azure Monitor)
    • Infrastructure as Code (Terraform, ARM, Cloud Formation)
    • Scripting or programming languages (Python, Bash, Go, Power Shell, or similar)
  • Experience supporting incident management and oncall operations
Preferred Skills
  • Experience with USAF Cloud One or Platform 1.
  • Experience with Zero Trust Architecture
  • Cloud certifications in AWS, Azure, Google, or Oracle clouds
SES provides a competitive salary and the following benefits:
  • Medical
  • Dental
  • Vision
  • AD&D
  • STD
  • LTD
  • Company paid Life Insurance
  • 401k with employer contribution
  • Paid Time Off
  • Pet Insurance
#J-18808-Ljbffr
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
 
 
 
Search for further Jobs Here:
(Try combinations for better Results! Or enter less keywords for broader Results)
Location
Increase/decrease your Search Radius (miles)
0
200
Filters
Education Level
Experience Level (years)
Posted in last:
Salary