Observability & Operations Engineer Job Phoenix area,Arizona USA,IT/Tech

Observability & Operations Engineer

About Us:

At Fullbay, our mission is simple - to create safer roads for our families and yours. As leaders in the heavy-duty repair industry, we power shops with technology that helps them run smarter and more efficiently. As an AI-First company, we invite artificial intelligence to eliminate friction, spark innovation, and drive efficiencies in every conversation- for our teams and our customers.

Position Overview:

The Observability & Operations Engineer is a key technical contributor who brings an AI-first mindset to maintaining, monitoring, and operating our AWS cloud environment and internal Developer Platform. In this role, you won't just react to incidents - you'll leverage AI-powered tooling, intelligent alerting, and automation to get ahead of problems before they impact users. You'll work deeply across AWS and its PaaS ecosystem, building repeatable, code-first pipelines that treat infrastructure and observability configuration as first-class software.

From using AI coding assistants to accelerate runbook development, to applying ML-based anomaly detection across logs and metrics, you'll be expected to ask "how can AI help here?" as a first instinct. Working within a dedicated platform team, you'll build the observability foundations that keep our systems fast, reliable, and self-healing.

Primary

Duties & Responsibilities:

* Design and implement a comprehensive observability strategy (logging, metrics, tracing, alerting) across all AWS environments, leveraging AI-powered tools to detect anomalies and surface insights automatically

* Build and manage monitoring platforms such as Datadog, Grafana, Prometheus, and AWS Cloud Watch - actively exploring AI-native features within these tools to reduce alert fatigue and improve signal quality

* Use AI coding assistants (e.g. Git Hub Copilot, Claude) to accelerate development of dashboards, runbooks, and automation scripts

* Own the incident management lifecycle - on-call rotations, post-mortems, root cause analysis - and apply AI-assisted log analysis to speed up diagnosis and resolution

* Instrument Java, Kotlin, and Node.js-based cloud-native applications to emit structured logs, distributed traces, and metrics; identify opportunities to use ML-based anomaly detection in place of static thresholds

* Build repeatable, code-first observability pipelines that treat dashboards, alerts, and runbooks as first-class software - versioned, tested, and deployed through Harness

* Leverage AWS PaaS services (Lambda, API Gateway, ECS, RDS, SQS, SNS, and others) to build scalable, automated operational tooling

* Collaborate with development teams to embed observability and AI-assisted quality checks into CI/CD pipelines via Harness

* Own the Fin Ops function for our AWS environment - tracking cloud spend, building cost dashboards, identifying waste, and using AI-powered cost analysis tools to surface optimization opportunities and drive accountability across engineering teams

* Monitor AWS infrastructure for performance, availability, and cost - partnering with finance and engineering to enforce spend governance

* Develop and maintain Infrastructure as Code using Terraform, using AI pair programming to improve quality and consistency

* Contribute to architectural decisions with a focus on resilience, automation, and reducing toil through intelligent systems

* Adheres to all confidentiality and compliance regulations

* Performs other duties as assigned

Minimum Education &

Work Experience:

* 7-10 years of experience in Software Engineering, Cloud Operations, or Site Reliability Engineering

* 5+ years of hands-on experience with AWS infrastructure and AWS PaaS services; certifications are a plus

* Demonstrated experience building repeatable, code-first pipelines and treating operational configuration as first-class software

* Experience working with polyglot environments including Java, Kotlin, and Node.js

* Demonstrated experience using AI tools (coding assistants, AI-powered observability platforms, or similar) in a professional setting - we're an AI-first company and expect this to be part of how you work, not something you're just exploring

Key Skills and

Qualifications:

* Deep experience with enterprise observability platforms - including AWS-native tooling such as Cloud Watch, X-Ray, and Open Telemetry, or comparable platforms such as Datadog, Grafana, or Prometheus

* Proficiency with distributed tracing frameworks and log management platforms (e.g. ELK Stack, Splunk, Fluent Bit); experience mapping these patterns to AWS-native tooling is a strong plus

* Strong understanding of SRE principles including SLOs, SLAs, error budgets, and chaos engineering

* Hands-on Fin Ops experience - cloud cost allocation, chargeback modeling, rightsizing, and savings plans optimization across AWS

* Strong working knowledge of AWS PaaS services including Lambda, API Gateway, ECS, RDS, SQS, SNS, and IAM - and how to leverage them to build scalable operational tooling

* Experience…