Site Reliability Engineer
Listed on 2026-02-16
-
IT/Tech
Cloud Computing, Systems Engineer, SRE/Site Reliability, IT Support
At Flex Dental, we go beyond checking boxes; our integration and automation are unparalleled. Every feature serves a purpose, creating seamless collaboration with Open Dental’s practice management system. Our commitment to meaningful functionalities and innovative automation transforms workflows, ensuring efficiency and pushing the boundaries of Open Dental practice management.
Flex Dental is focused on simplifying the lives of dentists and their staff. We're a growing company specializing in a specific area of the dental industry and work exclusively with Open Dental to create a comprehensive solution. By integrating with Open Dental, we aim to deliver innovative tools and services that streamline dental practice management. In short, we're developing cutting-edge solutions for dentists and fostering a great workplace culture for our team.
As a Site Reliability Engineer, you will own the availability, performance, and resilience of our production systems. You will partner closely with engineering, product, and leadership to reduce operational risk, eliminate toil, and ensure our customers’ businesses run without interruption.
This role blends deep technical execution with strong judgment, ownership, and communication.
Responsibilities- Be available to respond to critical service incidents outside of business hours on a rotating on-call schedule.
- Proactively monitor application health and performance across cloud infrastructure (AWS).
- Lead incident response, including triage, mitigation, root cause analysis (RCA), and post-incident reviews.
- Lead and participate in disaster recovery drills and security incident simulations.
- Build and maintain Infrastructure as Code (IaC) using AWS-native tooling.
- Collaborate with development teams to improve CI/CD reliability, deployment safety, and rollback strategies.
- Work closely with stakeholders and product teams to ensure technical reliability aligns with business needs.
- Reduce operational toil through automation, tooling, and process improvements.
- Own and evolve observability systems (metrics, logging, tracing, and alerting).
- Champion best practices across security, availability, performance, and incident response.
- 3+ years of experience in Site Reliability Engineering, Dev Ops, or a closely related role.
- Strong hands-on experience operating production systems in AWS (EC2, ECS, RDS, IAM, Cloud Watch).
- Experience implementing Infrastructure as Code (Cloud Formation, CDK, or Terraform).
- Proficiency in Node.js or Python for automation and operational tooling.
- Experience with Docker and container-based deployments (ECS preferred; Kubernetes a plus).
- Strong understanding of MySQL operations, backups, and performance monitoring.
- Proficiency with Git-based workflows and CI/CD systems.
- Familiarity with frontend frameworks (React, Ember.js) to understand performance implications.
- Experience operating customer-facing SaaS systems with uptime and performance SLAs.
- Exposure to security incident response and compliance-driven environments (HIPAA awareness is a plus).
- Incident Response:
Calm, methodical, and effective under pressure. - System Design:
Strong intuition for failure modes, scalability, and resiliency. - Automation Mindset:
Relentless focus on reducing manual work and repeatable processes. - Collaboration:
Clear communicator who partners effectively with engineering and product. - Security Awareness:
Proactive, pragmatic approach to risk and data protection. - Ownership:
Treats reliability as a product feature, not a support function.
- 3+ years of experience in a Site Reliability, Dev Ops, or related engineering role.
- Proven track record managing and scaling applications in a production AWS environment.
- Familiarity with full stack environments, particularly those using Node.js.
- Experience maintaining and deploying databases such as MySQL with performance tuning.
- Experience with container orchestration (e.g., ECS or Kubernetes is a plus).
- Commitment to uptime, performance, and security in fast-moving SaaS environments.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).