×
Hier anmelden um sich kostenlos auf Stellen zu bewerben oder Stellenanzeigen aufzugeben. X

Senior Site Reliability Engineer; SRE - Data Center

in 10115, Berlin, Berlin, Deutschland
Unternehmen: Hamilton Barnes Associates Limited
Vollzeit position
Verfasst am 2025-12-27
Berufliche Spezialisierung:
  • IT/Informationstechnik
    Systemingenieur, Cloud Computing, Site Reliability Ingenieur/in, Netzwerkingenieur
Gehalts-/Lohnspanne oder Branchenbenchmark: 200000 EUR pro Jahr EUR 200000.00 YEAR
Stellenbeschreibung
Stellenbezeichnung: Senior Site Reliability Engineer (SRE) - Data Center

Join a stealth-mode hyperscale data center start-up building an AI and cloud platform, powered by thousands of H100s, H200s, and B200s, ready to go for experimentation, full-scale model training, or inference. As a Senior Site Reliability Engineer, you’ll own the reliability, performance, and automation of this GPU-powered infrastructure, ensuring seamless orchestration across environments managed by Slurm, Kubernetes, or direct SSH access.

This is a rare opportunity to work at the intersection of hyperscale infrastructure and AI, shaping the operational backbone of one of the largest GPU clusters in private deployment.

If you want to build and operate infrastructure for frontier AI workloads, automate systems at petascale, and be part of a founding engineering team, this is the place to do it.

If you are interested in this incredible opportunity, get in touch today! You don't want to miss out!

Responsibilities
  • Design, deploy, and maintain large-scale GPU clusters (H100/H200/B200) for training and inference workloads.
  • Build automation pipelines for provisioning, scaling, and monitoring compute resources across Slurm and Kubernetes environments.
  • Develop observability, alerting, and auto-healing systems for high-availability GPU workloads.
  • Collaborate with ML, networking, and platform teams to optimise resource scheduling, GPU utilisation, and data flow.
  • Implement infrastructure-as-code, CI/CD pipelines, and reliability standards across thousands of nodes.
  • Diagnose performance bottlenecks and drive continuous improvements in reliability, latency, and throughput.
Skills / Must Have
  • 7+ years of experience in SRE, Dev Ops, or Infrastructure Engineering roles supporting large-scale compute environments.
  • Strong hands‑on experience with Kubernetes and Slurm for cluster orchestration and workload management.
  • Deep knowledge of Linux systems, networking, and GPU infrastructure (NVIDIA H100/H200/B200 preferred).
  • Proficiency in Python, Go, or Bash for automation, tooling, and performance tuning.
  • Experience with observability stacks (Prometheus, Grafana, Loki) and incident response frameworks.
  • Familiarity with high‑performance computing (HPC) or AI/ML training infrastructure at scale.
  • Background in reliability engineering, distributed systems, or hardware acceleration environments is a strong plus.
Benefits
  • Equity
Salary
  • €200,000 gross per year
#J-18808-Ljbffr
Stellen-Anforderungen
10+ Jahre Berufserfahrung
Bitte beachten Sie, dass derzeit keine Bewerbungen aus Ihrem Zuständigkeitsbereich für diese Stelle über diese Jobseite akzeptiert werden. Die Präferenzen der Kandidaten liegen im Ermessen des Arbeitgebers oder des Personalvermittlers und werden ausschließlich von diesen bestimmt.
Um nach Stellen zu suchen, sie anzusehen und sich zu bewerben, die Bewerbungen aus Ihrem Standort oder Land akzeptieren, klicken Sie hier, um eine Suche zu starten:
 
 
 
Suchen Sie hier nach weiteren Stellen:
(nach Beruf, Fähigkeit)
Standort
Increase search radius (miles)

Sprache der Stellenausschreibung
Lebenslauf-Kategorie
Bildungsgrad
Filter
Mindest-Bildungsgrad für die Stelle
Mindest-Berufserfahrung für die Stelle
Veröffentlicht in den letzten:
Gehalt