ML Operations & Customer Support Engineer/Senior level KSA
Listed on 2026-06-01
-
IT/Tech
Systems Engineer, Cloud Computing
About Us
Qualcomm is enabling a world where everyone and everything can be intelligently connected. You interact with products and technologies made possible by Qualcomm every day, including intelligent edge devices, next-generation computing platforms, and advanced AI solutions. Qualcomm’s leadership in AI, high‑performance compute, and connectivity is driving innovation across cloud, edge, and data center environments – delivering scalable, power‑efficient platforms that power the next generation of intelligent infrastructure.
Aboutthe Role
Qualcomm is seeking a Machine Learning Operations & Customer Support Engineer within the Customer Engineering team to support strategic customers deploying AI inference workloads on advanced Qualcomm AI inference accelerators. These accelerators utilize Qualcomm's expertise in hardware-accelerated AI to deliver high-performance, energy-efficient generative AI and computer vision inference solutions for modern data centers. This is a customer‑facing, production‑critical role focused on ensuring maximum system uptime, reliability, and performance, while resolving customer support cases within defined SLAs/KPIs.
The role requires deep expertise across ML inference pipelines, systems troubleshooting, and data center operations, working closely with customers, internal engineering, and product teams.
- Act as the primary technical escalation point for customer issues related to AI inference workloads
- Own end‑to‑end case management, ensuring resolution within agreed SLAs and KPIs
- Drive incident response, triage, and root cause analysis (RCA)
- Provide timely and transparent communication to customers on issue status and resolution
- Maintain high levels of customer satisfaction and service reliability
- Ensure high availability and uptime of customer AI deployments (rack‑scale systems)
- Monitor system health, performance metrics, and workload behavior
- Implement and manage failover, redundancy, and resiliency mechanisms
- Proactively identify risks and implement preventative actions
- Support deployment, optimization, and troubleshooting of ML inference pipelines
- Debug issues across model, runtime, system, and hardware layers
- Analyze model performance (latency, throughput, accuracy trade‑offs) in production
- Support frameworks such as PyTorch, Tensor Flow, ONNX, and model conversion flows
- Assist in model optimization techniques (quantization, batching, compilation, runtime tuning)
- Support bare‑metal and virtualized environments for AI workloads
- Troubleshoot issues across Linux OS, drivers, firmware, and networking stack
- Support deployment and maintenance using Infrastructure as Code (IaC) and automation tools
- Work with DCIM tools and monitoring systems for infrastructure visibility
- Coordinate with hardware vendors for accelerator, server, and networking issues
- Implement and manage monitoring systems (logs, metrics, traces)
- Build dashboards for uptime, SLA adherence, performance, and utilization
- Automate repetitive operational tasks using scripts and workflows
- Establish and enforce runbooks and standard operating procedures (SOPs)
- Work closely with Customer Engineering, Product, Engineering, and Support teams
- Provide structured feedback to engineering for product improvements and defect resolution
- Support customer onboarding, deployment readiness, and operational handover
- Participate in customer reviews, escalations, and technical deep dives
- Bachelor’s degree in Computer Science, Computer Engineering, Electrical Engineering, or related field
- 10–15+ years of experience in ML operations, systems engineering, or customer support engineering
- Proven experience in customer‑facing technical roles with SLA‑driven support models
- Strong experience with AI/ML inference workloads in production environments
- Deep understanding of end‑to‑end ML inference pipelines
- Hands‑on experience with Linux systems, system bring‑up, drivers, and debugging tools
- Strong understanding of AI accelerator…
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).