Site Reliability Engineer Job Alpharetta area,Georgia USA,IT/Tech

At Flex Dental, we go beyond checking boxes; our integration and automation are unparalleled. Every feature serves a purpose, creating seamless collaboration with Open Dental’s practice management system. Our commitment to meaningful functionalities and innovative automation transforms workflows, ensuring efficiency and pushing the boundaries of Open Dental practice management.

Flex Dental is focused on simplifying the lives of dentists and their staff. We're a growing company specializing in a specific area of the dental industry and work exclusively with Open Dental to create a comprehensive solution. By integrating with Open Dental, we aim to deliver innovative tools and services that streamline dental practice management. In short, we're developing cutting-edge solutions for dentists and fostering a great workplace culture for our team.

As a Site Reliability Engineer, you will own the availability, performance, and resilience of our production systems. You will partner closely with engineering, product, and leadership to reduce operational risk, eliminate toil, and ensure our customers’ businesses run without interruption.

This role blends deep technical execution with strong judgment, ownership, and communication.

Responsibilities

Be available to respond to critical service incidents outside of business hours on a rotating on-call schedule.
Proactively monitor application health and performance across cloud infrastructure (AWS).
Lead incident response, including triage, mitigation, root cause analysis (RCA), and post-incident reviews.
Lead and participate in disaster recovery drills and security incident simulations.
Build and maintain Infrastructure as Code (IaC) using AWS-native tooling.
Collaborate with development teams to improve CI/CD reliability, deployment safety, and rollback strategies.
Work closely with stakeholders and product teams to ensure technical reliability aligns with business needs.
Reduce operational toil through automation, tooling, and process improvements.
Own and evolve observability systems (metrics, logging, tracing, and alerting).
Champion best practices across security, availability, performance, and incident response.

Core Requirements

3+ years of experience in Site Reliability Engineering, Dev Ops, or a closely related role.
Strong hands-on experience operating production systems in AWS (EC2, ECS, RDS, IAM, Cloud Watch).
Experience implementing Infrastructure as Code (Cloud Formation, CDK, or Terraform).
Proficiency in Node.js or Python for automation and operational tooling.
Experience with Docker and container-based deployments (ECS preferred; Kubernetes a plus).
Strong understanding of MySQL operations, backups, and performance monitoring.
Proficiency with Git-based workflows and CI/CD systems.

Nice to Have

Familiarity with frontend frameworks (React, Ember.js) to understand performance implications.
Experience operating customer-facing SaaS systems with uptime and performance SLAs.
Exposure to security incident response and compliance-driven environments (HIPAA awareness is a plus).

Core Competencies

Incident Response:
Calm, methodical, and effective under pressure.
System Design:
Strong intuition for failure modes, scalability, and resiliency.
Automation Mindset:
Relentless focus on reducing manual work and repeatable processes.
Collaboration:

Clear communicator who partners effectively with engineering and product.
Security Awareness:
Proactive, pragmatic approach to risk and data protection.
Ownership:
Treats reliability as a product feature, not a support function.

Qualifications

3+ years of experience in a Site Reliability, Dev Ops, or related engineering role.
Proven track record managing and scaling applications in a production AWS environment.
Familiarity with full stack environments, particularly those using Node.js.
Experience maintaining and deploying databases such as MySQL with performance tuning.
Experience with container orchestration (e.g., ECS or Kubernetes is a plus).
Commitment to uptime, performance, and security in fast-moving SaaS environments.

#J-18808-Ljbffr


Increase/decrease your Search Radius (miles)



Job Posting Language