More jobs:
Kafka Tier 3 Support Engineer
Job in
Canton, Norfolk County, Massachusetts, 02021, USA
Listed on 2026-06-02
Listing for:
Tata Consultancy Service Limited
Full Time
position Listed on 2026-06-02
Job specializations:
-
IT/Tech
Cybersecurity, IT Support, Cloud Computing, Data Security
Job Description & How to Apply Below
• Strong hands on experience with Apache Kafka
• Experience supporting at least one of: o AWS MSK o Confluent Platform / Confluent Cloud o Self managed Kafka (VM or Kubernetes)
• Deep understanding of: o Brokers, partitions, replication, ISR, leader election o Consumer groups and rebalancing o Producer/consumer internals and failure modes Operations & Performance
• Expertise in diagnosing: o Consumer lag and throughput bottlenecks o Broker disk, network, and JVM performance o Metadata and controller instability
• Experience with monitoring and observability tools (Kafka metrics, Cloud Watch, Prometheus, Grafana, etc.) Security & Governance
• Knowledge of Kafka security concepts: o TLS, authentication (IAM/SASL/SCRAM), ACLs/RBAC o Principle of least privilege
• Experience supporting regulated or multi tenant environments Preferred / Nice to Have Skills
• Experience with Kafka Connect, Schema Registry, or streaming frameworks
• Exposure to KRaft-based Kafka deployments
• Cloud platforms (AWS preferred; Azure/GCP beneficial)
• Automation and IaC experience for Kafka operations
• Experience in SRE or Dev Ops-aligned environments
Roles & Responsibilities
Key Responsibilities
1. Tier 3 Incident Management & Escalation Support
• Act as the highest technical escalation point for Kafka production incidents (Sev 1 / Sev
2).
• Lead deep troubleshooting across: o Broker instability, controller elections, ISR shrinkage o Under replicated partitions and leader imbalance o Producer/consumer failures, lag spikes, and rebalance storms o Disk, network, JVM, and request handler saturation
• Provide hands on remediation for complex issues, including: o Partition reassignment and leader rebalance o Broker configuration tuning o Throttle/quota strategies for noisy producers or consumers
• Coordin ate with vendor support during service incidents, providing logs, metrics, and forensic details.
• Guide Tier 2 teams during major incidents and validate restoration actions.
2. Kafka Performance Engineering & Optimization
• Analyze Kafka workloads for performance and scalability risks: o Partition skew and hot partitions o Inefficient producer batching/compression o Consumer lag root cause analysis o Thread pool, I/O, and network bottlenecks
• Recommend and validate: o Topic design (partition count, replication factor, retention, compaction) o Producer and consumer configuration best practices o Quotas, quotas enforcement, and multi tenant controls
• Support onboarding of high throughput or latency sensitive workloads, ensuring Kafka is correctly sized and tuned.
3. Platform Stability, Reliability & Resilience
• Diagnose and resolve systemic Kafka stability issues: o Repeated broker failures or flapping o Metadata/controller instability (Zookeeper or KRaft) o Recovery issues following failovers or maintenance events
• Support resilience initiatives: o Multi AZ cluster health validation o Replication and DR strategies (Mirror Maker 2, Replicator, or app level DR patterns) o Failover testing and validation
• Define and improve Kafka SLOs for availability, durability, and latency.
4. Change, Upgrade & Configuration Leadership
• Lead medium to high risk Kafka changes, including: o Broker and cluster configuration changes o Partition expansion or large scale reassignment o Topic policy changes impacting durability or performance
• Support and plan: o Kafka version upgrades o MSK / Confluent upgrade cycles o Client compatibility and rollout strategies
• Participate in CAB reviews, assess risk, and design rollback and validation plans.
5. Root Cause Analysis & Continuous Improvement
• Own RCA documentation for major incidents with clear corrective and preventive actions (CAPA).
• Identify recurring failure patterns and architectural gaps.
• Re commend platform-level improvements: o Automation opportunities o Guardrails and standards o Monitoring and alerting enhancements
• Contribute to continuous improvement of runbooks, knowledge base articles, and operational playbooks.
6. Mentorship & Collaboration
• Provide technical guidance and mentoring to Tier 2 Kafka support teams.
• Collaborate with: o…
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
Search for further Jobs Here:
×