×
Register Here to Apply for Jobs or Post Jobs. X

Kafka Tier 3 Support Engineer

Job in Canton, Norfolk County, Massachusetts, 02021, USA
Listing for: Tata Consultancy Service Limited
Full Time position
Listed on 2026-06-02
Job specializations:
  • IT/Tech
    Cybersecurity, IT Support, Cloud Computing, Data Security
Job Description & How to Apply Below
Must Have Technical/Functional Skills Kafka & Streaming

• Strong hands on experience with Apache Kafka

• Experience supporting at least one of: o AWS MSK o Confluent Platform / Confluent Cloud o Self managed Kafka (VM or Kubernetes)

• Deep understanding of: o Brokers, partitions, replication, ISR, leader election o Consumer groups and rebalancing o Producer/consumer internals and failure modes Operations & Performance

• Expertise in diagnosing: o Consumer lag and throughput bottlenecks o Broker disk, network, and JVM performance o Metadata and controller instability

• Experience with monitoring and observability tools (Kafka metrics, Cloud Watch, Prometheus, Grafana, etc.) Security & Governance

• Knowledge of Kafka security concepts: o TLS, authentication (IAM/SASL/SCRAM), ACLs/RBAC o Principle of least privilege

• Experience supporting regulated or multi tenant environments Preferred / Nice to Have Skills

• Experience with Kafka Connect, Schema Registry, or streaming frameworks

• Exposure to KRaft-based Kafka deployments

• Cloud platforms (AWS preferred; Azure/GCP beneficial)

• Automation and IaC experience for Kafka operations

• Experience in SRE or Dev Ops-aligned environments

Roles & Responsibilities

Key Responsibilities

1. Tier 3 Incident Management & Escalation Support

• Act as the highest technical escalation point for Kafka production incidents (Sev 1 / Sev
2).

• Lead deep troubleshooting across: o Broker instability, controller elections, ISR shrinkage o Under replicated partitions and leader imbalance o Producer/consumer failures, lag spikes, and rebalance storms o Disk, network, JVM, and request handler saturation

• Provide hands on remediation for complex issues, including: o Partition reassignment and leader rebalance o Broker configuration tuning o Throttle/quota strategies for noisy producers or consumers

• Coordin ate with vendor support during service incidents, providing logs, metrics, and forensic details.

• Guide Tier 2 teams during major incidents and validate restoration actions.

2. Kafka Performance Engineering & Optimization

• Analyze Kafka workloads for performance and scalability risks: o Partition skew and hot partitions o Inefficient producer batching/compression o Consumer lag root cause analysis o Thread pool, I/O, and network bottlenecks

• Recommend and validate: o Topic design (partition count, replication factor, retention, compaction) o Producer and consumer configuration best practices o Quotas, quotas enforcement, and multi tenant controls

• Support onboarding of high throughput or latency sensitive workloads, ensuring Kafka is correctly sized and tuned.

3. Platform Stability, Reliability & Resilience

• Diagnose and resolve systemic Kafka stability issues: o Repeated broker failures or flapping o Metadata/controller instability (Zookeeper or KRaft) o Recovery issues following failovers or maintenance events

• Support resilience initiatives: o Multi AZ cluster health validation o Replication and DR strategies (Mirror Maker 2, Replicator, or app level DR patterns) o Failover testing and validation

• Define and improve Kafka SLOs for availability, durability, and latency.

4. Change, Upgrade & Configuration Leadership

• Lead medium to high risk Kafka changes, including: o Broker and cluster configuration changes o Partition expansion or large scale reassignment o Topic policy changes impacting durability or performance

• Support and plan: o Kafka version upgrades o MSK / Confluent upgrade cycles o Client compatibility and rollout strategies

• Participate in CAB reviews, assess risk, and design rollback and validation plans.

5. Root Cause Analysis & Continuous Improvement

• Own RCA documentation for major incidents with clear corrective and preventive actions (CAPA).

• Identify recurring failure patterns and architectural gaps.

• Re commend platform-level improvements: o Automation opportunities o Guardrails and standards o Monitoring and alerting enhancements

• Contribute to continuous improvement of runbooks, knowledge base articles, and operational playbooks.

6. Mentorship & Collaboration

• Provide technical guidance and mentoring to Tier 2 Kafka support teams.

• Collaborate with: o…
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
 
 
 
Search for further Jobs Here:
(Try combinations for better Results! Or enter less keywords for broader Results)
Location
Increase/decrease your Search Radius (miles)
0
200
Filters
Education Level
Experience Level (years)
Posted in last:
Salary