Brightstar Lottery - Cloud/Site Reliability Engineer
Listed on 2025-12-07
-
IT/Tech
Cloud Computing, Systems Engineer
Overview
Brightstar is an innovative, forward-thinking global leader in lottery that builds on our renowned expertise in delivering secure technology and producing reliable, comprehensive solutions for our customers. As a premier pure play global lottery company, our best-in-class lottery operations, retail and digital solutions, and award-winning lottery games enable our customers to achieve their goals, fulfill player needs and distribute meaningful benefits to communities.
Brightstar has a well-established local presence and is a trusted partner to governments and regulators around the world, creating value by adhering to the highest standards of service, integrity, and responsibility. Brightstar has approximately 6,000 employees. For more information, please visit
We are seeking a Cloud/Site reliability Engineer to join our Cloud Infrastructure Engineering, Operations & Automation team. This role is designed for engineers who are passionate about building resilient systems, preventing incidents before they occur, and driving operational excellence through intelligent monitoring, AI-driven automation, and continuous improvement.
You’ll play a pivotal role in evolving our cloud-hosted environments to be more self-aware, self-healing, and scalable, ensuring high availability and performance of our applications and services, and contributing with your investigation on issues that are meant to facilitate the engagement of L3 product engineers in case of production incidents.
ResponsibilitiesAs a Cloud/Site reliability Engineer, you will focus on Level 2 (L2) operational ownership with a strong emphasis on proactive monitoring, root cause analysis, and automation-driven remediation:
Monitoring & Observability
- Design and refine monitoring strategies using tools like Dynatrace, Prometheus, and ELK.
- Develop alerting standards that reduce noise and increase signal quality.
- Continuously improve observability to detect anomalies before they impact users.
- Assess application workloads key metrics for performance and reliability, together with infrastructure and middleware monitoring.
- Identify Public/Hybrid Cloud issues in services and resources.
- Correlate alerts with telemetry and logs to identify systemic issues and improvement opportunities.
- Work with L3 product engineers and with cloud vendors towards the resolution of the cases.
Automation & Self-Healing
- Design, build, and maintain robust automation pipelines using tools such as Terraform, Ansible, Jenkins, Helm, and Bash to streamline cloud operations.
- Develop and implement self-healing capabilities that proactively detect and remediate issues, minimizing manual intervention and downtime.
- Analyze operational workflows to identify repetitive tasks and transform them into scalable, automated solutions.
- Collaborate with the Architecture team to enhance and enforce cloud baseline standards for consistency and reliability.
- Automate incident response and recovery processes leveraging tools like Pager Duty to accelerate resolution and improve system resilience.
Cloud Infrastructure Operations
- Advanced experience with both Azure and AWS cloud service providers.
- Manage Cloud infrastructure and services.
- Monitor and optimize Cloud resources usage.
- Open and manage Microsoft support tickets in collaboration with L3.
- Participate in 24x7 On-Call rotation with after-hours support for critical incident response.
- Hands-on experience in cloud operation or site reliability engineering field.
- Practical experience in public cloud infrastructure and services management (Azure / AWS public cloud knowledge would be preferred).
- Proficiency in scripting and automation (Terraform, Power Shell, Python, Bash).
- Experience with Infrastructure as Code (IaC) and Git Ops principles.
- Hands-on experience on K8s and containers orchestration.
- Expertise in monitoring tools (Dynatrace, Datadog, Prometheus, ELK).
- Strong analytical, troubleshooting, and communication skills.
Preferred Qualifications
- Apply Agentic AI techniques to drive intelligent automation, optimize cloud services, accelerate troubleshooting and root-cause analysis, and enhance system resilience and recoverability.
- Familiarity with AI/ML Ops or AI-assisted observability tools.
- Thorough understanding of Java application workloads, and Java performance related topics.
- Deep knowledge of one programming language (Java/ Python / Go).
- Strong Linux and networking skills.
- Understanding software architecture patterns and app-dev principles.
- Public cloud certifications would be considered as a plus.
- Experience in a 24/7 operations environment.
You’ll be part of a forward-thinking Cloud Infrastructure Engineering, Operations & Automation team that values prevention over reaction, automation over repetition, and collaboration over silos. Your work will directly contribute to building a more resilient, scalable, and intelligent cloud ecosystem.
Keys to Success#LI-BK1 #LI-HYBRID
At Brightstar, we consider a wide range of factors in determining…
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).