Observability Architect
Job in
Atlanta, Fulton County, Georgia, 30383, USA
Listed on 2026-06-18
Listing for:
TechDigital Group
Full Time
position Listed on 2026-06-18
Job specializations:
-
IT/Tech
Cloud Computing: Infrastructure & Operations, SRE/Site Reliability, Systems Engineer
Job Description & How to Apply Below
Job Description
We are seeking an experienced Observability Architect to design, implement, and mature enterprise-wide observability capabilities across hybrid on-premises and cloud environments. The ideal candidate has deep expertise with log aggregation, metrics, tracing, and application performance monitoring technologies, and can drive automation, standardization, and best‑practice adoption s role will be a key influencer in shaping the organization’s observability strategy, ensuring end‑to‑end system visibility, performance, and reliability.
Key Responsibilities- Observability Architecture & Strategy
- Develop and maintain the enterprise observability reference architecture, covering logs, metrics, traces, events, dashboards, and alerts.
- Lead the design and implementation of observability solutions that support hybrid multi‑cloud and on‑premise environments.
- Establish standards, governance, and reusable frameworks for telemetry generation, ingestion, correlation, storage, and visualization.
- Drive continuous improvement of monitoring maturity, integrating data‑driven insights and AI‑based analytics where applicable.
- Log Aggregation & Monitoring Solutions
- Architect and administer large‑scale log aggregation platforms such as Splunk, supporting both on‑prem and cloud deployments.
- Define and automate ingestion pipelines, parsing logic, index strategies, role‑based access, and performance tuning.
- Implement configuration management and infrastructure‑as‑code (IaC) practices for repeatable deployment and scaling of observability tools.
- Application & Network Performance Monitoring
- Deploy, configure, and optimize APM solutions such as App Dynamics, Dynatrace, or equivalent platforms.
- Integrate application tracing, synthetic monitoring, real‑user monitoring (RUM), and business transaction analytics.
- Support and enhance Network Performance Monitoring (NPM) capabilities to ensure end‑to‑end visibility across distributed systems.
- Cloud‑Native & Modern Monitoring
- Leverage cloud‑native monitoring tools across AWS, Azure, or GCP (e.g., Cloud Watch, Azure Monitor, GCP Operations Suite).
- Guide teams in instrumenting microservices, serverless functions, containers, and Kubernetes clusters using Open Telemetry and modern telemetry standards.
- Partner with infrastructure, application, and SRE teams to ensure high availability, resilience, and performance.
- Automation & AI‑Driven Engineering
- Build automated workflows for alert tuning, anomaly detection, dashboards, and telemetry enrichment.
- Explore and integrate AI/ML‑based observability features such as predictive analytics, signal correlation, and automated root‑cause analysis.
- Advocate for automation‑first practices and reduction of operational toil.
- 5+ years of hands‑on experience with enterprise‑scale log aggregation platforms, including architecture, deployment, and administration of tools like Splunk across on‑prem and cloud environments.
- 5+ years of experience using automated configuration management and IaC tools (e.g., Ansible, Terraform, Git Ops frameworks).
- 2+ years of experience with APM tools such as App Dynamics or Dynatrace, including end‑to‑end application visibility and performance diagnostics.
- Experience with Network Performance Monitoring tools and methodologies.
- Strong understanding of cloud infrastructure and cloud‑native monitoring technologies (AWS, Azure, GCP).
- Familiarity with Open Telemetry, distributed tracing, and service mesh observability.
- Expertise in designing dashboards, KPIs, and alerting strategies that align to business SLIs/SLOs.
- Experience collaborating with Dev Ops, SRE, cloud engineering, and application teams in large enterprises.
- Experience implementing AI/ML‑driven observability capabilities (e.g., anomaly detection, auto‑baselining, correlation engines).
- Knowledge of container ecosystems and orchestration platforms (Kubernetes, AKS/EKS/GKE).
- Experience working with event‑driven architectures and microservices environments.
- Strong scripting or programming skills (Python, Power Shell, Bash, etc.).
- Relevant certifications (e.g., Splunk Architect, Dynatrace Professional, Cloud certifications).
- Excellent communication and stakeholder management skills.
- Ability to lead technical strategy and influence architectural decisions.
- Strong analytical, troubleshooting, and problem‑solving abilities.
- Adaptability and curiosity about new technologies and evolving observability trends.
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
Search for further Jobs Here:
×