Monitoring Engineering Production Services Specialist ll Job Chandler area,Arizona USA,IT/Tech

Job Description

This role provides support to end users and handles incidents and problem management for multiple applications. The primary focus is on triage activities for all business‑impacting incidents.

Responsibilities

Leads production support triage efforts
Manages bridge line troubleshooting
Engages in technical research and escalates issues to leadership as needed
Ensures all impacts are accurately recorded and documented in the system of record
Verifies that documents and wikis are updated and available for use during triage
Supports on‑call responsibilities for incidents
Documents application flows, impacts during outages, customer experience, and contacts for support needs
Provides status updates and technical detail for awareness communications (infrastructure, application and client impact, component points of failure)
Ensures the accuracy of all communications sent and schedules necessary reconvenes
Identifies business impact, interprets monitors, dashboards, and logs, and writes queries to quantify and communicate impacts to leadership
Promotes and enforces production governance during triage/testing, identifies production failure scenarios, vulnerabilities, and improvement opportunities, and escalates issues as needed
Analyzes, manages, and coordinates incident management activities to detect problems that affect service levels
Fulfills research requests, ad hoc reports, and offline incidents at the direction of senior team members or the Technology/Production Services teams

Required Qualifications

Hands‑on experience with Splunk (search, SPL, dashboards, alerts, data onboarding, and tuning)
Hands‑on experience with Dynatrace (APM, services/entities, alerting profiles, management zones, dashboards)
Strong understanding of monitoring and observability concepts: logs, metrics, traces, events, and correlation
Experience supporting production systems and incident management and operational support
Knowledge of SRE concepts such as reliability engineering, alert hygiene, post‑incident reviews, and automation
Experience working with ITSM processes (incident, problem, change) and tracking SI actions to closure
Basic to intermediate scripting experience (e.g., Python, Shell) for automation and analysis
Strong communication skills and ability to work across distributed teams in the APAC region

Desired Qualifications

Experience with advanced Splunk or Dynatrace features (custom metrics, anomaly detection, DQL/SPL optimization, synthetic monitoring)
Experience integrating monitoring tools with Service Now or similar ITSM platforms
Familiarity with capacity monitoring, performance engineering, or business transaction monitoring
Relevant certifications (Splunk, Dynatrace, SRE/Dev Ops, Cloud) are a plus

Skills

Adaptability
Analytical Thinking
Influence
Production Support Risk Management
Automation
Collaboration
Innovative Thinking
Result Orientation
Solution Design
Business Acumen
Dev Ops Practices
Project Management
Solution Delivery
Process
Stakeholder Management

Shift & Hours

Shift: 1st shift (United States of America)

Hours per week: 40

#J-18808-Ljbffr