SRE Kafka Lead
Job in
Orlando, Orange County, Florida, 32885, USA
Listed on 2026-06-22
Listing for:
Compunnel, Inc.
Full Time
position Listed on 2026-06-22
Job specializations:
-
IT/Tech
SRE/Site Reliability, Cloud Computing: Infrastructure & Operations, Systems Engineer
Job Description & How to Apply Below
The SRE (Kafka Lead) is responsible for leading the architecture, engineering, reliability, automation, security, and operational excellence of enterprise Kafka platforms. This role serves as the technical leader for Kafka and distributed streaming solutions, driving platform design, operational readiness, observability, automation, and reliability engineering initiatives. The ideal candidate combines deep Kafka expertise, Site Reliability Engineering (SRE) practices, Dev Ops automation, and strong leadership capabilities to deliver highly available, secure, and scalable streaming platforms.
Key Responsibilities- Serve as the technical lead for Kafka platform architecture, engineering, and operational strategy.
- Define and implement Kafka architectures aligned with industry best practices for distributed systems and event‑driven platforms.
- Provide technical leadership and direction to engineering teams, operations teams, and managed service providers.
- Identify technical risks, architectural gaps, scalability concerns, and improvement opportunities.
- Own platform outcomes and ensure successful delivery of reliability, scalability, and operational objectives.
- Design, deploy, configure, scale, and optimize Kafka clusters in production environments.
- Design and implement topic strategies, partitioning models, replication configurations, and consumer/producer optimization techniques.
- Configure and support Kafka ecosystem components including Kafka Connect, Kafka Streams, Schema Registry, and related technologies.
- Troubleshoot and resolve complex production incidents affecting Kafka platforms and distributed systems.
- Develop and maintain automation solutions using scripting languages such as Python and Bash.
- Implement Infrastructure‑as‑Code solutions using Terraform and similar technologies.
- Design, develop, and support CI/CD pipelines for Kafka platform deployments and operational processes.
- Eliminate manual operational dependencies through automation and self‑service capabilities.
- Design and implement end‑to‑end observability solutions including metrics, logging, tracing, and monitoring.
- Establish Kafka‑specific monitoring for consumer lag, throughput, broker health, cluster performance, and system availability.
- Integrate Kafka monitoring and observability solutions with enterprise monitoring platforms.
- Define and implement alerting strategies aligned to SLAs, SLOs, and operational requirements.
- Establish and manage Service Level Indicators (SLIs), Service Level Objectives (SLOs), and error budgets.
- Design high‑availability, resiliency, disaster recovery, and failover strategies for Kafka platforms.
- Lead incident management, troubleshooting, post‑incident reviews, and Root Cause Analysis (RCA) activities.
- Implement and enforce enterprise security controls including encryption, authentication, authorization, access controls, and auditability.
- Ensure compliance with organizational security policies and governance standards.
- Develop operational documentation, runbooks, standard operating procedures, and knowledge transfer materials.
- Facilitate knowledge transfer and operational readiness activities for internal engineering and support teams.
- Collaborate with leadership, engineering teams, and stakeholders to communicate platform status, risks, roadmaps, and strategic recommendations.
- Participate in critical troubleshooting sessions and ensure responsiveness during production incidents.
- Drive continuous improvement initiatives focused on platform reliability, scalability, automation, and operational excellence.
- Strong experience serving as a technical lead for Kafka, distributed systems, or streaming platform environments.
- Deep hands‑on experience with Apache Kafka administration, architecture, and operations.
- Experience with Kafka cluster setup, scaling, tuning, and performance optimization.
- Strong knowledge of topic design, partitioning strategies, replication, and producer/consumer optimization.
- Experience with Kafka security including ACLs, authentication, authorization, and encryption.
- Experience working with Kafka ecosystem technologies such as Kafka Connect, Kafka Streams, and Schema Registry.
- Ability…
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
Search for further Jobs Here:
×