Lead Platform Engineer Job Evansville area,Indiana USA,IT/Tech

Lead Platform Engineer (Monitoring & Observability)
Evansville, IN;
Baltimore, MD;
Wilmington, DE;
Charlotte, NC; or Irving, TX

Hybrid role, onsite 3 days per week as directed.
Candidates must live within 50 miles of the corporate office located in Evansville, IN;
Baltimore, MD;
Wilmington, DE;
Charlotte, NC; or Irving, TX.
Potential for contract extension

Note: MUST be legally authorized to work in the United States.
This role is NOT open to 3rd party providers, W2 only.

SUMMARY:

We're seeking a Lead Platform Engineer (Monitoring & Observability) to join a high performing Monitoring Engineering team within a fast paced financial technology organization. In this role, you will apply SRE principles to design, build, and evolve monitoring and observability capabilities that ensure the reliability, performance, and operability of core applications and infrastructure
You will partner closely with application, platform, and development teams to implement data driven alerting, SLO/SLA-based monitoring, telemetry pipelines, dashboards, correlations, and automated remediation. Your work will directly improve system reliability, reduce MTTR, and enhance enterprise wide operational insight
This role requires strong analytical thinking, systems engineering discipline, and a proactive approach to identifying risks, preventing incidents, and driving continuous improvement across the production ecosystem

KEY RESPONSIBILITIES:
Design, Build, and Maintain Monitoring & Observability Solutions

Architect, deploy, and operate Open Telemetry based telemetry pipelines, including instrumentation standards, collector configurations, sampling strategies, and routing to Elastic and other backends
Develop and maintain instrumentation, telemetry, and alerting for the Enterprise Monitoring Center using industry leading tools, such as:
Grafana, Ops Ramp, Elastic Stack, Big Panda | AWS Cloud Watch, Azure Monitor
Drive observability standards and best practices across multiple engineering teams through influence, documentation, and partnership rather than direct authority
Apply SRE best practices to ensure measurable SLIs/SLOs, reliability dashboards, and health indicators for critical systems
Integrate and manage Open Telemetry for distributed tracing and telemetry data collection, enabling end to end visibility of business critical transactions.

Collaboration & Project Participation

Collaborate with application development teams to define and document observability requirements for each project or release, ensuring accurate and actionable monitoring and tracing are in place for every step of business critical workflows
Embed reliability considerations early in the SDLC, including SLO definitions, instrumentation needs, and failure mode awareness
Partner with product and engineering teams to use SLOs and error budgets to guide release decisions, prioritization, and toil reduction

Alerting & Escalation Process

Define and maintain standardized alert payloads per engineering guidelines, ensuring alerts are actionable
Partner with Level 2 and Level 3 support teams to reflect process changes in monitoring dashboards
Maintain and optimize thresholds, ensuring seamless escalations via Big Panda as the central alert hub

Dashboard Creation & Maintenance

Create and maintain intuitive, actionable dashboards for the Enterprise Monitoring Center and other finance teams
Ensure dashboards are effectively monitored by Level 1 teams, presenting clear, actionable data that reduces MTTR

Documentation, Governance & Reliability Standards

Develop and maintain technical documentation, runbooks, diagnostic guides, and observability standards across the enterprise
Evaluate and refine release, deployment, and monitoring processes to support consistent, reliable delivery pipelines
Mentor junior engineers and promote a culture focused on reliability, automation, and operational excellence

Reliability Engineering, Automation & Continuous Improvement

Build automation frameworks for monitoring, alerting, self healing workflows, and incident response to reduce toil and improve MTTR
Drive system optimization through capacity analysis, performance tuning, and proactive detection of reliability risks
Contribute to the automation of routine operational tasks to improve system reliability and engineer quality of life
Advocate for and implement observability best practices across engineering teams
Define, implement, and operationalize SLIs, SLOs, and error budgets for critical services
Participate in and improve incident response processes, including detection, triage, escalation, and recovery

QUALIFICATIONS:
Education:

Experience:

At least 5+ years of experience in software, systems, or reliability engineering roles, with multiple years of hands on experience owning production observability, monitoring, and SLOs in distributed systems

Required Skills:

Deep experience building scalable, reliable monitoring and observability solutions, including instrumentation,…