×
Register Here to Apply for Jobs or Post Jobs. X

Kafka Tier 3 Support Engineer

Job in Canton, Norfolk County, Massachusetts, 02021, USA
Listing for: Tata Consultancy Service Limited
Full Time position
Listed on 2026-06-05
Job specializations:
  • IT/Tech
    Cybersecurity, IT Support, Cloud Computing, Data Security
Job Description & How to Apply Below
Must Have Technical/Functional Skills

Kafka & Streaming

• Strong hands on experience with Apache Kafka

• Experience supporting at least one of:

o AWS MSK

o Confluent Platform / Confluent Cloud

o Self managed Kafka (VM or Kubernetes)

• Deep understanding of:

o Brokers, partitions, replication, ISR, leader election

o Consumer groups and rebalancing

o Producer/consumer internals and failure modes

Operations & Performance

• Expertise in diagnosing:

o Consumer lag and throughput bottlenecks

o Broker disk, network, and JVM performance

o Metadata and controller instability

• Experience with monitoring and observability tools (Kafka metrics, Cloud Watch, Prometheus, Grafana, etc.)

Security & Governance

• Knowledge of Kafka security concepts:

o TLS, authentication (IAM/SASL/SCRAM), ACLs/RBAC

o Principle of least privilege

• Experience supporting regulated or multi tenant environments

Preferred / Nice to Have Skills

• Experience with Kafka Connect, Schema Registry, or streaming frameworks

• Exposure to KRaft-based Kafka deployments

• Cloud platforms (AWS preferred; Azure/GCP beneficial)

• Automation and IaC experience for Kafka operations

• Experience in SRE or Dev Ops-aligned environments

Roles & Responsibilities

Key Responsibilities

1. Tier 3 Incident Management & Escalation Support

• Act as the highest technical escalation point for Kafka production incidents (Sev 1 / Sev
2).

• Lead deep troubleshooting across:

o Broker instability, controller elections, ISR shrinkage

o Under replicated partitions and leader imbalance

o Producer/consumer failures, lag spikes, and rebalance storms

o Disk, network, JVM, and request handler saturation

• Provide hands on remediation for complex issues, including:

o Partition reassignment and leader rebalance

o Broker configuration tuning

o Throttle/quota strategies for noisy producers or consumers

• Coordin ate with vendor support during service incidents, providing logs, metrics, and forensic details.

• Guide Tier 2 teams during major incidents and validate restoration actions.

2. Kafka Performance Engineering & Optimization

• Analyze Kafka workloads for performance and scalability risks:

o Partition skew and hot partitions

o Inefficient producer batching/compression

o Consumer lag root cause analysis

o Thread pool, I/O, and network bottlenecks

• Recommend and validate:

o Topic design (partition count, replication factor, retention, compaction)

o Producer and consumer configuration best practices

o Quotas, quotas enforcement, and multi tenant controls

• Support onboarding of high throughput or latency sensitive workloads, ensuring Kafka is correctly sized and tuned.

3. Platform Stability, Reliability & Resilience

• Diagnose and resolve systemic Kafka stability issues:

o Repeated broker failures or flapping

o Metadata/controller instability (Zookeeper or KRaft)

o Recovery issues following failovers or maintenance events

• Support resilience initiatives:

o Multi AZ cluster health validation

o Replication and DR strategies (Mirror Maker 2, Replicator, or app level DR patterns)

o Failover testing and validation

• Define and improve Kafka SLOs for availability, durability, and latency.

4. Change, Upgrade & Configuration Leadership

• Lead medium to high risk Kafka changes, including:

o Broker and cluster configuration changes

o Partition expansion or large scale reassignment

o Topic policy changes impacting durability or performance

• Support and plan:

o Kafka version upgrades

o MSK / Confluent upgrade cycles

o Client compatibility and rollout strategies

• Participate in CAB reviews, assess risk, and design rollback and validation plans.

5. Root Cause Analysis & Continuous Improvement

• Own RCA documentation for major incidents with clear corrective and preventive actions (CAPA).

• Identify recurring failure patterns and architectural gaps.

• Re commend platform-level improvements:

o Automation opportunities

o Guardrails and standards

o Monitoring and alerting enhancements

• Contribute to continuous improvement of runbooks, knowledge base articles, and operational playbooks.

6. Mentorship & Collaboration

• Provide technical guidance and mentoring to Tier 2 Kafka support teams.

• Collaborate with:

o Application teams on Kafka client usage and best…
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
 
 
 
Search for further Jobs Here:
(Try combinations for better Results! Or enter less keywords for broader Results)
Location
Increase/decrease your Search Radius (miles)
0
200
Filters
Education Level
Experience Level (years)
Posted in last:
Salary