×
Register Here to Apply for Jobs or Post Jobs. X

Sr. Site Reliability Engineer

Job in Everett, Snohomish County, Washington, 98213, USA
Listing for: Tiger Analytics
Full Time position
Listed on 2026-05-31
Job specializations:
  • IT/Tech
    Cloud Computing, Systems Engineer, SRE/Site Reliability
Salary/Wage Range or Industry Benchmark: 100000 - 125000 USD Yearly USD 100000.00 125000.00 YEAR
Job Description & How to Apply Below

Role Overview

We are seeking a high-caliber Site Reliability Engineer (SRE) to join our Forward Engineering team. You will be the guardian of our production ecosystems, ensuring that our complex, data-driven AI platforms remain resilient, scalable, and highly performant. This role is a hybrid of software engineering and systems architecture, with a specialized focus on MLOps
—bridging the gap between model development and production-grade reliability.

Key Responsibilities 1. Reliability & Performance Engineering
  • SLA/SLO Management: Define, monitor, and maintain Service Level Objectives (SLOs) and Service Level Indicators (SLIs) for critical AI/ML services.
  • Error Budgeting: Manage error budgets to balance the velocity of feature releases from the ML team with the stability of the production environment.
  • Scalability: Architect and manage auto-scaling strategies for Kubernetes (GKE) to handle fluctuating workloads during model training and high-volume inference.
2. MLOps & AI Infrastructure
  • Model Serving Reliability: Ensure the high availability of Vertex AI endpoints and custom inference services.
  • GPU/TPU Optimization: Monitor and optimize compute resource utilization (accelerators) to ensure cost-efficient performance for Large Language Models (LLMs).
  • Pipeline Resilience: Support and stabilize ML pipelines (Vertex AI Pipelines/Kubeflow) to ensure seamless data flow from ingestion to model retraining.
3. Automation & Orchestration (Eliminating "Toil")
  • Infrastructure as Code (IaC): Use Terraform or Pulumi to provision and manage consistent, version-controlled cloud environments.
  • CI/CD & Git Ops: Design and optimize robust deployment pipelines for both application code and ML models using Git Hub Actions, Cloud Build, or ArgoCD.
  • Task Automation: Develop custom Python or Go scripts to automate repetitive operational tasks, self-healing mechanisms, and resource cleanup.
4. Monitoring, Alerting & Incident Response
  • Observability: Build and manage comprehensive dashboards using Prometheus, Grafana, or Google Cloud Operations Suite (Stackdriver).
  • Incident Management: Act as a primary responder in on-call rotations, leading the technical resolution of production outages.
  • Blameless Post-Mortems: Conduct deep-dive root cause analysis (RCA) to ensure systemic issues are identified and permanently remediated through code.

Orchestration: Expert-level knowledge of Kubernetes (K8s) and Docker.

MLOps Stack: Familiarity with tools such as Kubeflow, Vertex AI, MLflow, or DVC
.

Scripting: Strong proficiency in Python (for automation) and Bash; knowledge of Go is a plus.

Data Systems: Experience managing the reliability of data-heavy services (Big Query, Pub/Sub, or Vector Databases like Pinecone/Milvus).

Networking: Solid understanding of VPCs, Load Balancers, DNS, and secure service mesh (Istio/Anthos).

Benefits

Significant career development opportunities exist as the company grows. The position offers a unique opportunity to be part of a small, fast-growing, challenging and entrepreneurial environment, with a high degree of individual responsibility.

Tiger Analytics provides equal employment opportunities to applicants and employees without regard to race, color, religion, age, sex, sexual orientation, gender identity/expression, pregnancy, national origin, ancestry, marital status, protected veteran status, disability status, or any other basis as protected by federal, state, or local law.

#J-18808-Ljbffr
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
 
 
 
Search for further Jobs Here:
(Try combinations for better Results! Or enter less keywords for broader Results)
Location
Increase/decrease your Search Radius (miles)
0
200
Filters
Education Level
Experience Level (years)
Posted in last:
Salary