Platform - Site Reliability Engineer II; Networking Job Huntsville area,Alabama USA,IT/Tech

Position: Platform - Site Reliability Engineer II (Networking)

Platform - Site Reliability Engineer II (Networking)

Elastic, the Search AI Company, enables everyone to find the answers they need in real time, using all their data, at scale — unleashing the potential of businesses and people. The Elastic Search AI Platform, used by more than 50% of the Fortune 500, brings together the precision of search and the intelligence of AI to enable everyone to accelerate the results that matter.

By taking advantage of all structured and unstructured data — securing and protecting private information more effectively — Elastic’s complete, cloud-based solutions for search, security, and observability help organizations deliver on the promise of AI.

What is

The Role

As part of the Platform Engineering department, the Traffic team is crafting, building, and improving the multi‑cloud platform at scale for Elastic Cloud Hosted and Serverless. We grow and mature our distributed network services and solutions for multiple cloud service provider platforms. Built on Kubernetes, Go/Scala, and custom orchestration architectures, our day‑to‑day work involves coding, innovating technical designs, crafting solutions, improving resilience, and prioritizing security, bug fixes, and features.

For example, debugging Azure Networking for Elastic Cloud Serverless is part of our efforts, and we want your experience to contribute to a truly exceptional customer experience.

What You Will Be Doing

Leading technical initiatives for automating network engineering efforts to guarantee the reliability of the global Elastic infrastructure.
Growing our global Platform infrastructure to meet increasing scaling demands by developing and maintaining software, tooling, and automations.
Collaborating in an inclusive environment, focusing on operational excellence, and uplifting others.
Responding to and preventing repeated customer impact in response to major incidents and prioritized problem management. Our on‑call rotation uses a follow‑the‑sun model with participation during working hours.

What You Bring

Experience striving for “progress not perfection” in Platform reliability, with a customer‑first approach to solving operational problems from an SRE perspective.
Background in software engineering, able to collaborate with engineers to identify, implement, and deliver solutions. Experience in public cloud and managed Kubernetes services is advantageous.
Passion for developing solutions that involve inclusive communication methods to strengthen partner and team relationships. Experience working in distributed or remote teams is desirable.

Bonus Points

Operated a SaaS product in a public cloud using Infrastructure‑as‑Code tooling such as Crossplane or Terraform.
Built or operated a Kubernetes‑at‑scale infrastructure across multiple cloud providers, with the necessary automation.
Written non‑trivial programs in Go or other programming languages.
Worked with containerized services such as Docker.
Proven experience in leading and improving alerting and major incident management processes, metrics, and systems (Elastic Stack, Graphite, Prometheus, Influx) to diagnose issues and communicate impacts.
Experience in system administration with professional skills in Linux on distributed systems at scale.

#J-18808-Ljbffr


Increase/decrease your Search Radius (miles)



Job Posting Language