Principal Site Reliability Engineer Job Bellevue area,Washington USA,IT/Tech

What You'll Be Part Of:

iSpot.tv is changing how brands, agencies, and networks measure and assess the impact of TV advertising. We deal with BIG data, operating mainly in AWS with multiple Kubernetes clusters and thousands of servers. We are looking for an experienced SRE leader with the skills and passion to make a significant impact on our ecosystem. You will have a wide array of projects to tackle, with ample opportunities for growth.

You will be a key member of our SRE leadership team, focused on empowering developers to build, test, and deploy applications faster and more efficiently. You will both lead the team and remain hands-on in designing, building, and maintaining the tools, platforms, and processes that improve our engineering teams' productivity and streamline the software development lifecycle. Your work will directly impact developer happiness and the speed at which we can deliver innovative features to our customers.

Responsibilities:

We are seeking a seasoned and strategic Lead/Principal Site Reliability Engineer to drive the reliability, scalability, and performance of our core production systems while significantly enhancing the internal developer experience. This role sits at the intersection of operations and development, requiring deep technical expertise, strong leadership, and a passion for optimizing the entire software development lifecycle (SDLC).

Our team consists of senior engineers who work together with minimal supervision to attain those goals. Candidates must possess deep operational experience with AWS and Kubernetes to support teams utilizing these systems. You will lead the technical direction of the team while remaining a key individual contributor. You will be responsible for creating a culture of engineering excellence, designing self-service platforms, and fostering alignment across all engineering teams to accelerate product delivery and maintain world-class service stability.

The key responsibilities are:

* System Reliability and Operations (SRE Focus)

* Platform Design and Management:
Architect, build, and maintain scalable, highly available, and reliable cloud infrastructure in AWS leveraging modern container orchestration technologies.

* Data Pipeline Reliability:
Serve as the reliability and cost optimization expert for high-volume, data-intensive workloads. Focus on optimizing and ensuring the stability of distributed data processing engines, specifically Apache Spark and related ecosystems (e.g., EMR, Databricks, Glue).

* Observability and Monitoring:
Establish comprehensive observability practices by defining SLIs/SLOs, implementing advanced monitoring, alerting, and logging solutions to quickly identify and resolve system anomalies.

* Automation:
Drive automation across all operational aspects, including infrastructure provisioning (Terraform), scaling, deployment, and incident response, minimizing toil and manual effort.

* Incident Management:
Lead and participate in the incident response lifecycle, performing thorough post-mortems to derive actionable insights and implement preventative measures to improve system resilience.

* AIOps:
Define and champion the strategic roadmap for AI/ML integration within SRE, establishing organizational best practices for AIOps, automated incident remediation, Toil Reduction via LLMs, and Automated Root Cause Analysis (RCA) and the governance of LLM-driven tooling to enhance system observability and resilience.

* Developer Experience and Productivity (Dev Ex Focus)

* Platform Strategy:
Design, implement, and champion self-service tools, internal developer portals, and services that empower engineering teams to manage their infrastructure and deployments independently and efficiently.

* AI Developer Tools:
Lead the standardization of AI developer assistants by architecting and maintaining global 'steering files' and context-configuration standards, ensuring AI-generated code aligns with our specific patterns, security protocols, and architectural guardrails.

* CI/CD Optimization:
Own and continuously improve the CI/CD pipelines, reducing build times, streamlining deployment workflows, and integrating best practices for testing, security (Shift Left), and code quality. Maintain and improve our container orchestration and deployment tools, leveraging Kubernetes, Helm, and ArgoCD to create seamless developer workflows.

* KPIs:
Develop, implement, and maintain a set of key performance indicators (KPIs) to measure and improve the developer experience across all of Engineering.

* Mentorship and Documentation:
Guide and mentor senior engineers, promoting SRE/Dev Ex principles. Develop clear, comprehensive documentation and tutorials to ensure seamless adoption of new tools and platforms.

* Cost and Efficiency:
Strategically identify and implement opportunities for cloud cost optimization and resource efficiency without compromising reliability or performance.

III. Strategic Leadership and Cross-Team Alignment

* Architecting the

Roadmap:

Define,…