Site Reliability Engineer
New York, USA
Listed on 2026-02-17
-
IT/Tech
Cloud Computing, Systems Engineer, SRE/Site Reliability, Cybersecurity
Ready to embark on the quest of joining Hack The Box?
At the end of this thrilling journey, you’ll become a proud member of Hack The Box, with the ultimate mission to help cybersecurity professionals and organizations enhance their cyber-attack readiness. Get ready for an exciting adventure into the world of cybersecurity! 🚀🔒💻
The Core Mission of the Site Reliability Engineer (SRE)As a Site Reliability Engineer at Hack The Box, your paramount mission is to assist the seamless migration to AWS, strategically positioning our infrastructure to scale effectively with the company. Over the next 6 months, you will participate in enhancing our capabilities for expansion, setting the stage for the addition of new systems such as Kubernetes clusters, Services, and Databases. Additionally, your focus will shift towards establishing key performance indicators, service level objectives, and incident response metrics to drive a culture of reliability and continuous improvement.
🏢Location & Work Mode
Fully Remote / Hybrid (2 days in the office, 3 days remote, plus one month of work from anywhere).
When hiring in Greece, we’re open to candidates from all locations. Those based within 55 km of our Athens office will follow a hybrid work model. For candidates located beyond that radius, a fully remote arrangement is available.
The Fellowship You’ll Be JoiningYou’ll join a team of 6 SREs, while collaborating closely with engineers, data scientists, and security experts. Finally, you will report directly to the SRE Manager and will have open communication with infrastructure department management and other high-caliber technical people across the organization.
Technology Tools & Weapons You’ll Be Using- Infrastructure as Code (Terraform):
Automate the provisioning of AWS resources. - Containerization and Orchestration (Kubernetes, Flux CD):
Ensure seamless deployment and scaling of applications. - Monitoring and Logging (Prometheus, Mimir, Grafana, Loki):
Expand monitoring capabilities for new systems. - Automation and Scripting (Go, Python, etc):
Scripting for efficient and automated processes. - Cloud Platforms (AWS):
Execute the migration plan with a focus on AWS.
- Heavily contribute to the AWS Migration for Scalability:
Spearhead the migration from the current cloud provider towards AWS, strategically positioning our infrastructure for scalable growth across regions. - Expand Monitoring Stack:
Integrate new systems into the Monitoring Stack, enhancing visibility and alerting capabilities for a globally distributed architecture. - Architectural Design for Reliability:
Contribute to the design and implementation of reliable AWS infrastructure, focusing on fault tolerance and high availability. - Establish Metrics Framework:
Implement and manage Service Level Agreements (SLAs), Service Level Objectives (SLOs), and Service Level Indicators (SLIs) to measure and improve system reliability. - Incident Response Enhancement:
Develop and enhance incident response processes, leveraging metrics to continually improve response times and effectiveness. - Mentorship:
Mentor and guide junior SREs in adapting to the AWS environment and implementing reliability best practices. - Collaborative Planning:
Work closely with cross-functional teams to plan and implement new systems effectively, ensuring alignment with reliability goals. - Team Expansion:
Play a key role in the team's expansion, contributing to the mentoring junior members. - Best Practices Advocacy:
Champion best practices in AWS architecture and SRE methodologies, fostering a culture of reliability and continuous improvement.
- Hands-on
Experience:
Minimum 2 years of hands-on experience in site reliability engineering or a related field. - Automation
Skills:
Proficient in scripting and automation using languages such as Go, Python or Bash. - Cloud Expertise:
In-depth knowledge of cloud platforms, particularly AWS. - Containerization:
Experience with containerization technologies (Docker) and orchestration (Kubernetes). - Monitoring Mastery:
Strong…
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).