Site Reliability Engineer
Listed on 2025-12-27
-
IT/Tech
Systems Engineer, SRE/Site Reliability
The role is remote with in-office presence in Brno 1–2 times per month.
The CompanyCapital Markets Gateway LLC (CMG) is a capital markets‑focused fintech transforming global equity capital markets (ECM) through data, technology, and connectivity. As the preferred source for ECM analytics and the first network connecting the buy‑side and sell‑side for ECM workflows, we are committed to reshaping how capital markets operate. Founded in 2017 by a team of ECM practitioners, CMG has completed three successful fundraising rounds and is backed by a group of the world’s most prestigious financial institutions.
The CMG platform is currently relied upon by nearly 150 buy‑side firms representing $40 trillion in AUM and 22 global investment banks. For more information, please visit (Use the "Apply for this Job" box below)..
CMG is looking for a Site Reliability Engineer (SRE) with a strong focus on monitoring, observability, and alerting to ensure the reliability, performance, and scalability of our infrastructure and applications. You will be responsible for designing, implementing, and maintaining monitoring solutions to provide visibility into system health and performance, proactively detect anomalies, and reduce incident response time.
Our Engineering TeamThe CMG engineering team consists of domain experts who work collaboratively within a culture of cross‑domain knowledge sharing. We value engineers who are passionate about modern technologies and best practices.
Our engineers are encouraged to challenge the status quo and are constantly seeking improvement and efficiency in our code‑base and platform. CMG engineers are empowered to explore solutions using bleeding‑edge technologies such as AI and bring recommendations to the table. We are in a period of making impactful engineering decisions.
As part of our process, we believe in taking the time for research and prototyping – this is critical in making the right decisions. Given the experience of our team, we have naturally adopted best practices from local development, through code review and into production rollouts. Besides the standard pull requests, test automation, code coverage tracking, containerization, and one‑click deployments we are constantly reviewing these foundational components to develop new best practices.
ResponsibilitiesMonitoring & Observability
- Design, implement, and maintain monitoring and observability solutions using tools like Prometheus, Grafana Stack (Loki/Grafana/Tempo/Alert Manager), Datadog, and Open Telemetry.
- Define and implement SLOs, SLIs, and error budgets to measure system reliability.
- Develop and optimize dashboards, alerts, and reports for system performance and business metrics.
- Design actionable alerting strategies to minimize noise and improve MTTR.
- Integrate alerting systems with Jira.
- Establish and refine runbooks for on‑call teams to handle alerts efficiently.
- Empower teams to ensure observability coverage and incident response practices.
- Analyze system performance metrics, identify bottlenecks, and implement optimizations to improve system efficiency, scalability, and cost‑effectiveness.
- Help conduct load testing and capacity planning to ensure systems can handle peak traffic loads.
- Identify opportunities for automation and develop tools to streamline operational processes, such as fail‑over, configuration management, and monitoring.
- Implement monitoring and alerting systems within automations to detect and resolve issues proactively.
- Collaborate closely with cross‑functional teams, including software engineers, operations, and infrastructure teams, to understand system requirements, provide technical guidance, and drive solutions.
- Communicate effectively to stakeholders about system changes, incidents, and improvements.
- Foment and spread SRE principles and practices across company.
- Proven experience as a Site Reliability Engineer or similar role.
- Proficiency in logging, metrics, and tracing frameworks (Data Dog, Loki, Prometheus, Open Telemetry).
- Experience with cloud platforms (Azure preferred) and…
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).