AI/ML Observability Engineer
Job in
Coppell, Dallas County, Texas, 75019, USA
Listed on 2026-05-19
Listing for:
Dexian
Full Time
position Listed on 2026-05-19
Job specializations:
-
Software Development
AI Engineer, Cloud Engineer - Software, Machine Learning/ ML Engineer, DevOps
Job Description & How to Apply Below
Employment Type:
Contract-to-Hire Role Overview
We are seeking a hands‑on AI/ML Observability Engineer to design and build intelligent monitoring solutions that enhance system reliability and performance.
This role focuses heavily on observability engineering and creating anomaly detection models from scratch. You will develop AI/ML‑driven capabilities to detect, diagnose, and prevent issues across distributed systems, enabling proactive and automated operations.
You will work at the intersection of machine learning, observability platforms, and automation, helping transform traditional monitoring into intelligent, self‑ improving systems.
Key Responsibilities- Design, build, and deploy custom anomaly detection models from the ground up using telemetry data (logs, metrics, traces)
- Develop baselining, event correlation, and predictive analytics models to identify abnormal system behavior
- Enhance enterprise observability platforms by integrating AI/ML‑driven insights and intelligent alerting
- Build solutions that enable early detection of issues and proactive system resiliency
- Implement Open Telemetry‑based pipelines for collecting and analyzing telemetry across distributed systems
- Create real‑time and batch data pipelines to support ML‑driven observability use cases
- Develop AI‑powered alerting and root cause analysis (RCA) capabilities
- Build services/APIs in Python for model inference and operational integration
- Partner with SRE, platform, and engineering teams to improve monitoring, diagnostics, and incident response
- Contribute to observability best practices including SLOs, SLIs, and Golden Signals
- Drive automation and intelligent workflows to improve incident detection and resolution times
- Strong experience with Python and ML libraries (Num Py, Pandas, scikit‑learn, Tensor Flow or PyTorch)
- Proven experience building anomaly detection models from scratch (time series, statistical, or ML‑based approaches)
- Solid understanding of statistics, time series analysis, and pattern recognition
- Experience deploying ML models in production environments (real‑time and batch)
- Strong hands‑on experience with observability concepts, including:
- Metrics, logs, traces, spans
- Baselining and anomaly detection
- Event correlation
- Experience with tools such as:
- Grafana
- Dynatrace
- Experience implementing or working with Open Telemetry
- Experience building data pipelines for telemetry ingestion and processing
- Familiarity with Snowflake, AWS, or similar cloud platforms
- Experience working with distributed systems and microservices environments
- Experience building APIs or services for ML model integration
- Exposure to automation, CI/CD, and infrastructure workflows
- Ability to integrate ML outputs into alerting and operational system
- Experience with AI‑driven observability or AIOps platforms
- Exposure to Generative AI / LLMs (RAG, prompt engineering, etc.)
- Experience building:
- Self‑healing systems
- Automated remediation workflows
- AI‑driven alerting and RCA solutions
- Experience working with large‑scale telemetry data
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
Search for further Jobs Here:
×