Lead Platform Engineer Job Baltimore area,Maryland USA,IT/Tech

Lead Platform Engineer

Location:

Hybrid - Must reside within 50 miles of Baltimore, MD / Wilmington, DE / Charlotte, NC / Dallas, TX / New York, NY / Evansville, IN (local only to these location only)
Duration: 6+ Months Contract

Employment Type:

Contract
GC, USC

Required skills:

SRE, Open Telemetry, Elastic Observability, Grafana, Ops Ramp, Big Panda, AWS/Azure, Kubernetes, Docker, Cloud Watch, Monitoring & Alerting, SLI/SLO, Automation, Python/Bash/Power Shell, Distributed Systems, Incident Management, CI/CD, Terraform/Ansible.

Job Overview:
We are seeking an experienced Lead Platform Engineer to join a high-performing Monitoring & Observability Engineering team within a fast-paced enterprise environment. The ideal candidate will have strong expertise in Site Reliability Engineering (SRE), Open Telemetry, Elastic Observability, cloud monitoring, and enterprise platform reliability.

This role will focus on designing and enhancing observability frameworks, telemetry pipelines, monitoring standards, dashboards, alerting systems, and automation capabilities to improve system reliability, reduce MTTR, and support mission-critical production platforms.

Key Responsibilities:

• Design, deploy, and maintain Open Telemetry-based telemetry pipelines and observability frameworks

• Build and support enterprise monitoring solutions using Grafana, Elastic Stack, Ops Ramp, Big Panda, AWS Cloud Watch, and Azure Monitor

• Implement SRE best practices including SLIs, SLOs, error budgets, and reliability dashboards

• Collaborate with development, infrastructure, and platform teams to define observability and monitoring standards

• Develop actionable alerts, dashboards, tracing, and telemetry for distributed systems and business-critical applications

• Automate monitoring, incident response, self-healing workflows, and operational tasks

• Improve system reliability, performance tuning, capacity planning, and proactive issue detection

• Support incident management, root cause analysis, escalation, and recovery processes

• Maintain technical documentation, monitoring standards, runbooks, and diagnostic guides

• Mentor junior engineers and promote reliability engineering and operational excellence across teams

Required

Skills & Experience:

• 5+ years of experience in Platform Engineering, SRE, Reliability Engineering, or Monitoring Engineering roles

• Strong hands-on expertise with Open Telemetry, Elastic Observability (APM, Logs, Metrics, Traces), Grafana, Ops Ramp, Big Panda, Cloud Watch, and Azure Monitor

• Experience building scalable monitoring, dashboarding, telemetry, and alerting solutions across distributed environments

• Strong scripting/programming experience with Bash, Power Shell, Python, JavaScript, or C-family languages

• Expertise with AWS and/or Azure cloud platforms

• Strong experience with Kubernetes, Docker, and containerized platforms

• Understanding of distributed systems, networking, Dev Sec Ops , security, and performance engineering

• Excellent troubleshooting, analytical, and communication skills

• Ability to work cross-functionally in large enterprise environments

Preferred Qualifications:

• Experience with CI/CD tools such as Jenkins, Git Hub, Git Lab CI, or CircleCI

• Knowledge of Infrastructure as Code tools like Terraform or Ansible

• Experience with REST APIs, JSON, and Service Now

• Familiarity with microservices and event-driven architectures

• Experience with time-series data visualization and analytics

Education:

• Bachelor's degree in Computer Science, Information Technology, or related field preferred