×
Register Here to Apply for Jobs or Post Jobs. X

Kafka Tier 3 Support Engineer; Platform & Operations

Job in Canton, Norfolk County, Massachusetts, 02021, USA
Listing for: Diverse Lynx
Full Time position
Listed on 2026-06-11
Job specializations:
  • IT/Tech
    Systems Engineer, Cloud Computing, IT Support, Cybersecurity
Job Description & How to Apply Below
Kafka Tier 3 Support Engineer (Platform & Operations)

Location:

Canton, MA

Employment Type

Fulltime / Contract

Shape

Role Overview

The Kafka Tier 3 Support Engineer is a senior technical role responsible for expert level support, advanced troubleshooting, performance engineering, and platform stabilization of enterprise Apache Kafka environments.

This role functions as the final technical escalation point for Kafka-related production incidents and is accountable for root cause analysis (RCA), complex remediation, and longterm prevention. The engineer works closely with Tier2 operations, Platform Engineering, SRE teams, application teams, and vendor support (AWS MSK / Confluent / Cloud providers) to ensure Kafka remains a highly reliable, scalable, and secure streaming backbone.

Shape

Key Responsibilities

1. Tier 3 Incident Management & Escalation Support

Act as the highest technical escalation point for Kafka production incidents (Sev1 / Sev2).

Lead deep troubleshooting across:

Broker instability, controller elections, ISR shrinkage

Underreplicated partitions and leader imbalance

Producer/consumer failures, lag spikes, and rebalance storms

Disk, network, JVM, and request handler saturation

Provide handson remediation for complex issues, including:

Partition reassignment and leader rebalance

Broker configuration tuning

Throttle/quota strategies for noisy producers or consumers

Coordinate with vendor support during service incidents, providing logs, metrics, and forensic details.

Guide Tier2 teams during major incidents and validate restoration actions.

2. Kafka Performance Engineering & Optimization

Analyze Kafka workloads for performance and scalability risks:

Partition skew and hot partitions

Inefficient producer batching/compression

Consumer lag root cause analysis

Thread pool, I/O, and network bottlenecks

Recommend and validate:

Topic design (partition count, replication factor, retention, compaction)

Producer and consumer configuration best practices

Quotas, quotas enforcement, and multitenant controls

Support onboarding of highthroughput or latencysensitive workloads, ensuring Kafka is correctly sized and tuned.

3. Platform Stability, Reliability & Resilience

Diagnose and resolve systemic Kafka stability issues:

Repeated broker failures or flapping

Metadata/controller instability (Zookeeper or Food and Beverage)

Recovery issues following failovers or maintenance events

Support resilience initiatives:

Multi

AZ cluster health validation

Replication and DR strategies (Mirror Maker 2, Replicator, or applevel DR patterns)

Failover testing and validation

Define and improve Kafka SLOs for availability, durability, and latency.

4. Change, Upgrade & Configuration Leadership

Lead medium to highrisk Kafka changes, including:

Broker and cluster configuration changes

Partition expansion or large scale reassignment

Topic policy changes impacting durability or performance

Support and plan:

Kafka version upgrades

MSK / Confluent upgrade cycles

Client compatibility and rollout strategies

Participate in CAB reviews, assess risk, and design rollback and validation plans.

5. Root Cause Analysis & Continuous Improvement

Own RCA documentation for major incidents with clear corrective and preventive actions (CAPA).

Identify recurring failure patterns and architectural gaps.

Recommend platform-level improvements:

Automation opportunities

Guardrails and standards

Monitoring and alerting enhancements

Contribute to continuous improvement of runbooks, knowledge base articles, and operational playbooks.

6. Mentorship & Collaboration

Provide technical guidance and mentoring to Tier2 Kafka support teams.

Collaborate with:

Application teams on Kafka client usage and best practices

Platform and SRE teams on capacity planning and reliability engineering

Security teams on access control, encryption, and compliance requirements

Act as a subject matter expert for Kafka within the organization.

Shape

Required Technical Skills

Kafka & Streaming

Strong handson experience with Apache Kafka

Experience supporting at least one of:

AWS MSK

Confluent Platform / Confluent Cloud

Selfmanaged Kafka (VM or Kubernetes)

Deep understanding of:

Brokers, partitions, replication, ISR, leader election

Consumer groups and…
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
 
 
 
Search for further Jobs Here:
(Try combinations for better Results! Or enter less keywords for broader Results)
Location
Increase/decrease your Search Radius (miles)
0
200
Filters
Education Level
Experience Level (years)
Posted in last:
Salary