SRE Support Engineer - Observability Job Austin area,Texas USA,IT/Tech

Role Overview

The Observability & Tools Support Engineer provides high-impact technical support for customers of a large technology company’s internal IaaS platform, with a focus on monitoring, alerting, telemetry, and operational tooling.

This role spans a wide range of support—from white-glove onboarding and end-to-end customer enablement, to deep technical troubleshooting across Linux, networking, and observability systems (especially Prometheus and Alert Manager). You will also contribute to improving the support function itself: strengthening tooling, documentation, workflows, and feedback loops so the service scales.

Success depends on excellent troubleshooting, strong written communication, comfort working with highly technical customers, and the maturity to identify patterns and drive operational improvements beyond individual ticket resolution.

Business Outcome

Become a trusted frontline expert for the customer’s observability ecosystem and operational tooling - delivering fast, accurate support across Slack and tickets, improving monitoring reliability, and reducing incident impact through better triage, troubleshooting, onboarding, and knowledge capture.

Success Measures

Healthy volume of threads and tickets handled with high-quality outcomes
Consistent achievement of time-based SLAs
High customer satisfaction through surveys
Accurate classification of issue type, severity, and recurring patterns
Reduced repeat issues through better docs, tooling, and scalable onboarding

What Will Be True When You Succeed

Customers can onboard smoothly to monitoring/alerting with minimal friction
Monitoring and alerting issues are resolved quickly, with fewer escalations
Linux and networking-related incidents reach resolution faster due to strong troubleshooting and clean handoffs
Engineering and SRE teams receive clear, actionable feedback based on real customer trends
Knowledge base content prevents tickets and accelerates self-service

Core Work Units

Frontline Support for Observability & Tooling

Manage Slack threads and tickets (roughly 50/50)
Handle a broad range of customer support: simple issue resolution through end-to-end onboarding
Provide clear, structured guidance to highly technical customers
Maintain strong attention to detail while managing multiple interactions in parallel

Deep-Dive Troubleshooting & Incident Support

Troubleshoot, isolate, and resolve monitoring and alerting issues (especially Prometheus + Alert Manager)
Troubleshoot complex Linux and networking issues (TCP/IP fundamentals required)
Support Open Telemetry, tracing, and telemetry pipelines, including investigation of gaps in signals and instrumentation
Drive incidents to resolution in partnership with Engineering/SRE teams

Documentation & Knowledge Development

Build and maintain customer-facing and internal knowledge base articles
Create informational posts for the community support platform
Turn repeated issues into reusable guides, checklists, and onboarding playbooks

Trend Analysis & Feedback to Engineering

Analyze and categorize customer interaction trends
Provide accurate, meaningful feedback to Engineering and SRE orgs to improve product/tooling
Identify “top offenders” and propose practical fixes (tooling, docs, process, product)

Operational Excellence & Continuous Improvement

Participate in post-mortem reviews and drive follow-through on improvements
Contribute meaningfully to team objectives and goals (process, tooling, and service scaling)
Bring creativity and discretion to resolve highly complex issues “outside the box”

High-Quality Work - what top performance looks like

Frontline Support

Moves smoothly from triage to deeper analysis without losing the customer
Communicates clearly and confidently with technical users
Maintains clean follow-ups and thread hygiene even with high context switching

Troubleshooting

Rapidly isolates issues across monitoring/alerting configs, Linux runtime behavior, and network connectivity
Uses structured approaches to incident handling: hypothesis → test → evidence → resolution
Produces high-signal writeups that accelerate downstream resolution

Documentation & Enablement

Documentation is clear enough that customers avoid…


Increase/decrease your Search Radius (miles)



Job Posting Language