Senior Site Reliability Engineer
Listed on 2026-02-12
-
IT/Tech
Systems Engineer, Cloud Computing, SRE/Site Reliability, IT Support
Overview
Demand Bridge (a Fluent Software Group/Valsoft company) operates mission-critical platforms that support core business and customer-facing systems. Our infrastructure runs on Microsoft Azure and Cloud Foundry, supporting production workloads with high availability, security, and compliance requirements. Reliability, automation, and operational excellence are foundational to how we operate. We invest in systems and practices that scale responsibly, reduce risk, and enable engineering teams to ship with confidence.
TheOpportunity
Demand Bridge is seeking a Senior Site Reliability Engineer to own the day-to-day reliability, availability, and operational readiness of our cloud platform and Azure-based infrastructure. This role serves as the primary Dev Ops / SRE owner for platform stability, automation, and compliance-related tooling.
This is a hands-on, high-autonomy role ideal for someone who enjoys troubleshooting across layers, improving systems through automation and documentation, and thoughtfully adopting modern tooling—including AI-assisted operational tools—to improve incident response, observability, and operational efficiency.
You’ll work closely with a junior teammate and coordinate with external vendors, while remaining the lead for systems reliability and operational excellence.
What You’ll Do- Platform & Reliability Ownership
- Own and operate a production cloud platform running on Microsoft Azure and Cloud Foundry (or comparable platforms)
- Ensure availability, performance, and reliability across infrastructure and platform components
- Serve as the primary escalation point for platform-level incidents
- Incident Response & Operational Excellence
- Lead incident response, root cause analysis, and post-incident remediation
- Use modern monitoring, alerting, and AI-assisted observability tools to improve detection, diagnosis, and resolution of incidents
- Drive continuous improvements to reduce operational risk, after-hours incidents, and manual intervention
- Security, Certificates & Secrets
- Own certificate and secrets lifecycle management, including TLS automation and secure secrets handling (e.g., Cred Hub, Vault)
- Ensure secure and compliant practices around identity, access, and credential management
- Partner with engineering teams to embed security and reliability best practices into platform workflows
- Automation & Infrastructure
- Automate common operational tasks using Bash and/or Power Shell
- Support and extend infrastructure-as-code using Terraform and/or Bicep
- Improve platform consistency and repeatability through Git-driven, automation-first workflows
- Leverage AI-assisted tooling to support scripting, troubleshooting, and operational documentation
- Compliance & Documentation
- Support PCI and other compliance activities, including technical control implementation, audit support, and remediation tracking
- Maintain clear runbooks, diagrams, and documentation to enable repeatable operations and knowledge transfer
- Partner with internal teams and external auditors to support compliance requirements
- Collaboration & Leadership
- Work closely with application engineers, junior SRE/support staff, and vendor partners
- Provide technical guidance and mentorship to junior teammates
- Act as a trusted partner to engineering teams on reliability, performance, and operational readiness
- 5+ years of experience in SRE, Dev Ops, or infrastructure engineering roles supporting production environments
- Hands-on experience with Cloud Foundry, Kubernetes, or Docker in production (Cloud Foundry preferred)
- Strong experience with Microsoft Azure, including networking, compute, IAM, and monitoring
- Strong Linux systems administration experience (RHEL preferred); comfort with Windows Server environments
- Proficiency in Power Shell and/or Bash scripting
- Solid understanding of TLS/PKI workflows, including certificate management and rotation
- Proven experience managing incidents end-to-end and performing root cause analysis
- Strong written communication skills and a disciplined approach to documentation
- Experience using modern automation, observability, or AI-enabled operational tools to improve reliability and efficiency
- Preferred (Nice To…
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).