More jobs:
Kafka Tier 3 Support Engineer; Platform & Operations
Job in
Canton, Norfolk County, Massachusetts, 02021, USA
Listed on 2026-06-11
Listing for:
Diverse Lynx
Full Time
position Listed on 2026-06-11
Job specializations:
-
IT/Tech
Systems Engineer, Cloud Computing, IT Support, Cybersecurity
Job Description & How to Apply Below
Location:
Canton, MA
Employment Type
Fulltime / Contract
Shape
Role Overview
The Kafka Tier 3 Support Engineer is a senior technical role responsible for expert level support, advanced troubleshooting, performance engineering, and platform stabilization of enterprise Apache Kafka environments.
This role functions as the final technical escalation point for Kafka-related production incidents and is accountable for root cause analysis (RCA), complex remediation, and longterm prevention. The engineer works closely with Tier2 operations, Platform Engineering, SRE teams, application teams, and vendor support (AWS MSK / Confluent / Cloud providers) to ensure Kafka remains a highly reliable, scalable, and secure streaming backbone.
Shape
Key Responsibilities
1. Tier 3 Incident Management & Escalation Support
Act as the highest technical escalation point for Kafka production incidents (Sev1 / Sev2).
Lead deep troubleshooting across:
Broker instability, controller elections, ISR shrinkage
Underreplicated partitions and leader imbalance
Producer/consumer failures, lag spikes, and rebalance storms
Disk, network, JVM, and request handler saturation
Provide handson remediation for complex issues, including:
Partition reassignment and leader rebalance
Broker configuration tuning
Throttle/quota strategies for noisy producers or consumers
Coordinate with vendor support during service incidents, providing logs, metrics, and forensic details.
Guide Tier2 teams during major incidents and validate restoration actions.
2. Kafka Performance Engineering & Optimization
Analyze Kafka workloads for performance and scalability risks:
Partition skew and hot partitions
Inefficient producer batching/compression
Consumer lag root cause analysis
Thread pool, I/O, and network bottlenecks
Recommend and validate:
Topic design (partition count, replication factor, retention, compaction)
Producer and consumer configuration best practices
Quotas, quotas enforcement, and multitenant controls
Support onboarding of highthroughput or latencysensitive workloads, ensuring Kafka is correctly sized and tuned.
3. Platform Stability, Reliability & Resilience
Diagnose and resolve systemic Kafka stability issues:
Repeated broker failures or flapping
Metadata/controller instability (Zookeeper or Food and Beverage)
Recovery issues following failovers or maintenance events
Support resilience initiatives:
Multi
AZ cluster health validation
Replication and DR strategies (Mirror Maker 2, Replicator, or applevel DR patterns)
Failover testing and validation
Define and improve Kafka SLOs for availability, durability, and latency.
4. Change, Upgrade & Configuration Leadership
Lead medium to highrisk Kafka changes, including:
Broker and cluster configuration changes
Partition expansion or large scale reassignment
Topic policy changes impacting durability or performance
Support and plan:
Kafka version upgrades
MSK / Confluent upgrade cycles
Client compatibility and rollout strategies
Participate in CAB reviews, assess risk, and design rollback and validation plans.
5. Root Cause Analysis & Continuous Improvement
Own RCA documentation for major incidents with clear corrective and preventive actions (CAPA).
Identify recurring failure patterns and architectural gaps.
Recommend platform-level improvements:
Automation opportunities
Guardrails and standards
Monitoring and alerting enhancements
Contribute to continuous improvement of runbooks, knowledge base articles, and operational playbooks.
6. Mentorship & Collaboration
Provide technical guidance and mentoring to Tier2 Kafka support teams.
Collaborate with:
Application teams on Kafka client usage and best practices
Platform and SRE teams on capacity planning and reliability engineering
Security teams on access control, encryption, and compliance requirements
Act as a subject matter expert for Kafka within the organization.
Shape
Required Technical Skills
Kafka & Streaming
Strong handson experience with Apache Kafka
Experience supporting at least one of:
AWS MSK
Confluent Platform / Confluent Cloud
Selfmanaged Kafka (VM or Kubernetes)
Deep understanding of:
Brokers, partitions, replication, ISR, leader election
Consumer groups and…
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
Search for further Jobs Here:
×