Observability & Monitoring Lead Job Noida area,Uttar Pradesh India,IT/Tech

Project

Description:

Support clients in the operation, maintenance, and optimization of Oracle Cerner EHR environments. This role is designed for early-career professionals who are eager to grow their technical skills in healthcare IT while working under the mentorship of experienced consultants and technical leaders. You will gain hands-on exposure to Cerner infrastructure, system workflows, and healthcare technology best practices while contributing to meaningful client outcomes.
Responsibilities:
Trend Analysis & Problem Identification
- Identify recurring incident patterns, anomalies, and signs of alert fatigue that may indicate deeper systemic issues.
- Collaborate with L2/L3 teams to review telemetry data and recommend improvements to alert thresholds, rules, and policies.
- Provide insights that support proactive issue prevention, noise reduction, and overall monitoring refinement.

2. Platform Management & Optimization
- Develop, update, and maintain dashboards that reflect realtime system health, performance metrics, and service behavior.
- Support the ongoing adoption and optimization of Dynatrace, enhancing dashboarding and visualization capabilities for cloud and onprem observability.
- Assist in routine platform checks, ensuring monitoring tools remain accurate, stable, and aligned with business and operational requirements.

3. Leadership & Collaboration
- Responsible for organizing the work for the team, including planning, task breakdown, and ensuring clarity of priorities.
- Provide structured, timely updates to leadership on progress, risks, blockers, team capacity, and delivery timelines.
- Work closely with application teams, SRE groups, and infrastructure operations during incident triage, investigations, and routine monitoring reviews.
- Ensure clear, timely, and effective communication with stakeholders during service-impacting events, providing status updates and context as needed.
- Ensures adherence to engineering best practices, drives operational excellence, and maintains accountability for team delivery outcomes

4. Operational Excellence
- Support platform stability and availability through adherence to lifecycle maintenance, patching schedules, and vulnerability management processes.
- Contribute to the improvement of monitoring workflows, alert routing logic, runbook effectiveness, and incident management practices.

5. Innovation & AI Enablement
- Assist in exploring and adopting AI-driven capabilities that improve observability, automate rootcause identification, and reduce manual effort.
- Contribute to internal knowledge sharing by documenting best practices, playbooks, AI reference materials, and usage guidelines (e.g., Copilot tips).

6. Collaboration & Leadership Support
- Partner with cross-functional teams to align monitoring practices with evolving business needs and operational priorities.
- Drive end-to-end delivery of monitoring initiatives—requirements gathering, planning, execution oversight, and delivery validation.
- Coordinate crossteam dependencies, ensure timelines are met, and proactively remove blockers for the team.
- Provide subject matter support for ITSM processes including incident, problem, and change management discussions.
Mandatory

Skills:

New Relic
Mandatory Skills

Description:

- 6+ years in Site Reliability Engineering or Observability/Monitoring engineering roles.
- 5+ years hands-on with monitoring/observability tools:
New Relic, Solar Winds ,WUG
- 4+ years of scripting experience (JavaScript, Java, Power Shell, or others)
- 2+ years with Azure (architecture fundamentals, observability in cloud-native and liftandshift contexts).
- 4+ year scripting with Python and Bash or Power Shell for automation.
- Experience troubleshooting complex distributed applications, leading/participating in war rooms, and performing codelevel impact analysis (read logs/stack traces, correlate with deploys and infra changes).
- Solid understanding of observability best practices (metrics, logs, traces), ITSM processes, and alert hygiene.
- Have the mindset of 'automate any task'
- Maintain associated documentation as it applies to our audit and certification requirements
- Ensure platform stability, availability, and compliance through proactive vulnerability management and lifecycle maintenance
- Drive process improvements for monitoring workflows and incident management
- Participate in troubleshooting, capacity planning, and performance analysis activities
- Research new monitoring requirements and in many cases write code for that
- Solid expertise in setting up monitoring policies/rules/templates; and writing scripts to accomplish monitoring requirements
- Excellent problem solving, communication, and crossteam collaboration skills.