×
Register Here to Apply for Jobs or Post Jobs. X
More jobs:

Senior Site Reliability Engineer, AI Inference

Job in Dublin, Laurens County, Georgia, 31021, USA
Listing for: F5 Networks, Inc
Full Time position
Listed on 2026-05-11
Job specializations:
  • IT/Tech
    AI Engineer
Salary/Wage Range or Industry Benchmark: 80000 - 100000 USD Yearly USD 80000.00 100000.00 YEAR
Job Description & How to Apply Below

AI Inference Engineer

At F5, we strive to bring a better digital world to life. Our teams empower organizations across the globe to create, secure, and run applications that enhance how we experience our evolving digital world. We are passionate about cybersecurity, from protecting consumers from fraud to enabling companies to focus on innovation.

Everything we do centers around people. That means we obsess over how to make the lives of our customers, and their customers, better. And it means we prioritize a diverse F5 community where each individual can thrive.

Role Objective

The AI Inference Engineer plays a critical role in the AI lifecycle by bridging the gap between high-performance model development and optimized deployment environments. This position focuses on optimizing Large Language Models (LLMs) for inference, serving diverse environments—from GPU-rich data centers to resource-constrained edge devices—with a strong emphasis on maximizing throughput, minimizing latency, and maintaining model accuracy. This role is pivotal in advancing F5’s AI capabilities, ensuring enterprise-grade reliability by leveraging hardware acceleration, designing scalable infrastructure, and monitoring system performance.

Key Responsibilities
  • High-Performance AI Serving
    • Build and maintain robust inference engines using tools like vLLM, TGI (Text Generation Inference), and NVIDIA Triton, ensuring high performance at scale.
    • Handle deployment optimizations to deliver low-latency AI serving solutions for multiple business applications.
  • Hardware Acceleration and Optimization
    • Profile and optimize models for specialized hardware backends, including NVIDIA GPUs (CUDA/Tensor

      RT), Apple Silicon (CoreML), and AI accelerators like TPUs and LPUs.
    • Collaborate with hardware teams to maximize utilization and performance across various computational environments.
  • Inference Orchestration and Scalability
    • Design and implement auto-scaling architectures for online (real-time) and batch inference pipelines, leveraging Kubernetes for inference routing and orchestration.
    • Ensure software solutions are optimized for peak performance during traffic spikes, maintaining reliability and scalability.
  • Performance Monitoring and Observability
    • Establish robust observability frameworks to monitor Time to First Token (TTFT), tokens per second, and memory bandwidth utilization against service-level agreements (SLAs).
    • Build and execute performance and load testing suites to identify bottlenecks and ensure consistent reliability at scale.
Technical Requirements
  • Programming Languages
    • Proficiency in Python, C++, Rust, or Golang specifically for high-performance AI workflows.
  • Inference Tools
    • Hands‑on experience with tools such as vLLM, Tensor

      RT, Llama.cpp, and Ollama for inference development and optimization.
  • Infrastructure Expertise
    • Strong familiarity with Docker, Kubernetes, and cloud platforms such as AWS, GCP, and Azure.
  • Hardware Optimization Expertise
    • Comprehensive understanding of GPU and AI hardware, with techniques for profiling and optimizing performance for accelerators like NVIDIA GPUs and TPUs.
Preferred Experience
  • Prior experience deploying Large Language Models (LLMs) with advanced techniques such as Speculative Decoding or Paged Attention.
  • Contributions to open-source inference libraries or hardware‑level kernel development (e.g., CUDA, Triton kernels).
  • Background in MLOps or SRE roles focused on high-performance AI endpoints and reliability during demand surges.
  • Proficiency in designing scalable solutions for high-throughput inference environments optimized for traffic bursts.
Success Metrics (KPIs)
  • Latency Reduction – Continuously improve inference latency metrics, ensuring minimal Time to First Token (TTFT) and maximum tokens per second.
  • Cost Efficiency – Achieve lower "Cost per 1K Tokens" through better resource utilization and hardware optimization.
  • Scalability – Maintain system stability and reliability during traffic spikes, ensuring performance consistency across environments.
  • Throughput Maximization – Deploy models optimized for peak hardware usage and maximized process throughput.
Why Join F5?
  • Collaborate with cutting‑edge technologies and hardware…
Position Requirements
10+ Years work experience
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
 
 
 
Search for further Jobs Here:
(Try combinations for better Results! Or enter less keywords for broader Results)
Location
Increase/decrease your Search Radius (miles)
0
200
Filters
Education Level
Experience Level (years)
Posted in last:
Salary