Site Reliability Engineer - Data Services
Listed on 2026-04-23
-
IT/Tech
Cloud Computing, SRE/Site Reliability, Systems Engineer
SRE is part of a global organization that leverages the latest technology to communicate with our colleagues across the globe. We organize ourselves into distributed teams – SRE teams are anchored to iManage offices across the globe. Tuesdays and Thursdays are dedicated to in‑office collaboration, rapid innovation, and developing a sense of belonging days and Fridays are reserved for (remote‑friendly) focus time to get things done.
Have the best of both work styles in a workplace that is intentional about belonging, collaboration, and accomplishment.
You are an engineer, a builder, and a systems thinker. You ensure data durability, optimize query performance, and manage stateful storage upgrades. You combine technical depth with empathy, working closely with customers who hold the highest expectations for the stewardship of the world’s most sensitive data. You elevate the people around you—acting as a subject‑matter expert, a mentor, and an agent of change.
You focus on contributing factors rather than single root causes, value code over documentation and documentation over process, and continuously seek ways to reduce toil. You participate in architectural and design discussions, help shape a scalable, resilient platform that supports both our customers and our organization, collaborate across teams to drive unified, standards‑based decisions that strengthen reliability, and take part in on‑call rotations and provide expertise in observability, change management, and system scalability.
As iManage experiences rapid growth in its flagship cloud product, we’re looking for engineers who bring a beginner’s mindset, embrace complexity, and care deeply about resilience and sustainability in a cloud‑native world. This role includes a strong focus on the reliability and evolution of our core data services, including Maria
DB, Max Scale, and Elasticsearch. If you write code, think in systems, automate relentlessly, and are passionate about reliability and scale, we want to talk to you.
- Eliminating TOIL through automation and software development.
- Partnering productively and cross‑functionally with application teams and other internal stakeholders.
- Creating a modern, cloud‑native platform that is resilient, cost‑effective, and secure by default.
- Scaling and tuning high‑availability data clusters (Maria
DB, Max Scale, Elasticsearch) in a Kubernetes environment. - Maintaining the freshness and utility of our platform services.
- Improving the security posture of our products.
- Writing / designing automation, orchestration, observability, and disaster readiness into our products.
- Coordinating and participating in production support and on‑call rotations.
- Leading incident management efforts and post‑incident retrospectives.
- Comfortability writing design documents / postmortems and refactoring application code when needed.
- Experience operating or supporting distributed data systems (e.g., relational databases, search clusters, or sharded storage systems).
- Developed automation to reduce the operational burden of a product or developed software‑as‑a‑service for internal customers.
- Ability to advocate for SRE concepts such as Google's SRE concepts (e.g., I know the differences between an SLO and an SLA and can effectively introduce them to an organization).
- Experience working in a public cloud and/or hosted datacenter environment (Azure and AKS strongly preferred).
- A passion for working collaboratively with other teams.
- Hands on experience with Maria
DB, Max Scale, or Elasticsearch in production environments. - Familiarity with data store observability, query performance tuning, or capacity planning.
- Hands on experience with Linux Server stacks (Ubuntu/Debian distributions preferred).
- Knowledge of cloud provisioning platforms (Hashi Corp Terraform preferred).
- Exposure to at least one configuration management platform (Chef preferred).
- Experience with containerization/clustering technologies (Docker preferred).
- Comfortability with observability and alerting tools (Prometheus/Grafana or ELK/EFK preferred).
- Practical…
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).