Site Reliability Engineer
Listed on 2026-02-17
-
IT/Tech
Cloud Computing, Systems Engineer, IT Support, SRE/Site Reliability
This is a contract to hire opportunity in the North Metro Atlanta area. This role is not open to C2C, OPT, or any Visa consideration. No vendor support of any kind allowed.
JOB DESCRIPTION Site Reliability Engineer – Observability Overview:We are seeking a skilled Site Reliability Engineer III to join our Platform Engineering team, focusing on building and maintaining a comprehensive observability platform. In this role, you will ensure that our microservices, Kubernetes clusters, and cloud infrastructure are consistently reliable, high-performing, and scalable. You will work closely with cross-functional teams to provide deep insights into system health, performance, and availability through metrics, logs, and traces.
This is a key role for those passionate about creating robust, proactive monitoring systems to support troubleshooting and optimization.
- Develop and sustain a resilient observability stack using tools such as Prometheus, Grafana, Loki, Influx
DB, Telegraf, Open Telemetry, and more. - Collaborate with Dev Ops, engineering, and product teams to understand monitoring requirements and deliver data-driven insights for better decision-making.
- Design and implement monitoring solutions across diverse environments, including Kubernetes clusters, microservices, AWS, Azure, on-prem vSphere setups, and networks using Windows, Linux, Cisco, Juniper, Arista, and more.
- Aggregate and store logs, metrics, and traces from distributed systems to ensure comprehensive, end-to-end visibility.
- Develop alerting mechanisms based on KPIs and thresholds to support proactive performance monitoring and application uptime.
- Create and maintain dashboards to monitor system health, application performance, and resource utilization.
- Build solutions for monitoring key application metrics, including latency, request rates, error rates, and service dependencies.
- Support incident response efforts, collaborating with Dev Ops, SRE, and development teams to troubleshoot and resolve performance issues.
- Define and implement automated incident response workflows using observability data.
- Participate in post-incident analyses to identify root causes and continuously improve system reliability.
- Identify areas to improve observability practices, including better instrumentation, alerting, and reporting strategies.
- Document observability setups, best practices, and troubleshooting techniques.
- Stay informed on the latest observability technologies and industry trends to enhance the observability ecosystem.
- Provide regular reports and dashboards on system health and performance metrics to ensure transparency for stakeholders.
- Bachelor’s degree in Computer Science, Engineering, Information Technology, or related field (or equivalent practical experience).
- 3–5 years of experience in observability, monitoring, or related areas such as SRE, Dev Ops, or Platform Engineering.
- Proven experience in building, scaling, and managing observability solutions for complex infrastructure environments (Kubernetes, AWS, Azure, on-prem vSphere, and Windows/Linux).
- Proficiency with Git version control, including branch management, conflict resolution, and Git Hub workflows, along with experience in CI/CD using Git Hub Actions.
- Familiarity with VMware vSphere, cloud platforms (AWS, GCP, Azure), and containerized environments (Docker and Kubernetes).
- Relevant certifications (e.g., VMware Certified Professional - VCP, AWS Certified Dev Ops Engineer, Google Cloud Professional Dev Ops Engineer, Certified Kubernetes Administrator) are a plus.
- Deep understanding of observability principles, including metrics, logs, and traces.
- Strong experience with monitoring tools (Prometheus, Grafana, Influx
DB, Telegraf, etc.) and Kubernetes/containerized workloads. - Knowledge of cloud-native technologies, Infrastructure as Code (IaC), and Dev Ops practices.
- Experience with Application Performance Management (APM) tools.
- Proficient in scripting and automation with languages like Python, Bash, or Go.
- Skilled in data visualization and reporting, using tools like Grafana and Kibana.
- Ability to troubleshoot complex issues using logs, metrics, and traces for effective incident response.
- Strong collaboration and communication skills for working with SRE, Dev Ops, and engineering teams.
- Problem-solving mindset with attention to detail in designing observability solutions.
- Adaptable to a fast-paced, evolving technical environment.
- Eagerness to stay up-to-date with trends in observability, cloud technologies, and distributed systems.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).