Site Reliability Engineer SINDC
Listed on 2026-02-07
-
IT/Tech
Systems Engineer, Cloud Computing
Overview
We are currently sourcing for a Site Reliability Engineer to work in Client's Enterprise Infrastructure Group in Westlake TX or Merrimack NH.
Shift: When on call 10 am EST – 8 pm EST; when not on call M-F 9:00 am – 5:00 pm. On call (twice a week, one of those days may be weekend)
Responsibilities- Provide enterprise Cloud and Platform Engineering support for production environments and participate in on-call rotation to provide solutions.
- Lead all aspects of production support—readiness, availability and resiliency of critical Applications, Batches & Infrastructure representing various business units while being centrally aligned to the Production Services organization.
- Define & implement practices in resiliency engineering, automation, observability & chaos testing to improve system reliability.
- Solve stack-wide engineering issues related to hardware, software, network, applications, and cloud service providers.
- Balance delivery with ad hoc workloads and re-evaluate priorities as needed.
- Triaging, root cause analysis and decisive problem solving under pressure.
- Maintain scalability and resiliency of a complex environment; implement advanced observability practices and techniques at scale.
- Provide instrumentation and operations for building and operating, monitoring, logging, and alerting services of distributed systems at scale.
- Manage and interpret large datasets using query languages and visualization tools.
- Ensure familiarity with ITIL processes such as incident management, change management and problem management.
- Experience with on-prem and cloud environments (AWS and Azure), including building and operating highly resilient platforms in public clouds; experience with migration skills.
- Experience with container orchestration (preferably Kubernetes) and Dev Ops concepts including CI/CD pipelines.
- Hands-on experience with observability tools (Prometheus, Grafana, ELK/Open Search, Open Telemetry, Datadog, etc.).
- Use Datadog, Catchpoint, Splunk & Grafana for application observability and monitoring of app & infrastructure.
- Experience with infrastructure as code tools (IAM, ARM, Terraform, Chef).
- Handle a large fleet of on-prem servers (including security & patching oversight) and manage hundreds of SSL certificates for all applications in scope.
- Proven experience performing chaos testing to build confidence in the system's ability to withstand turbulent conditions in production.
- Strong communication skills to reach both technical and non-technical audiences.
- Bachelor’s degree or equivalent experience or higher in a technology-related field (e.g., Engineering, Computer Science);
Master’s degree a plus. - 5-8+ years of hands-on experience deploying and/or supporting highly distributed multi-tiered systems at scale.
- Hands-on experience with public cloud environments, preferably AWS and Azure; certifications a plus.
- Exposure to basic OS-level scripting languages (Korn/Bash/JavaScript).
- On-call experience running incidents.
- Experience with one or more observability tools (Prometheus, Grafana, ELK/Open Search, Open Telemetry, Datadog, etc.).
The team comes from diverse technical backgrounds, and the responsibilities provide the opportunity for a variety of challenges. Ideal candidates will have a background in either software engineering or systems engineering with a desire to learn the other or previous experience as an SRE.
The RoleYou will have the opportunity to lead all aspects of production support - readiness, availability and resiliency of critical Applications, Batches & Infrastructure representing various business units while being centrally aligned to the Production Services organization. Offer a plethora of opportunities to augment knowledge across multiple dimensions of Technology at the same time retaining key focus on Cloud Computing (AWS & Azure) & Enterprise tools/solutions like Jenkins, uDeploy, Docker, Kubernetes, Splunk, Datadog, etc.
This role will provide a truly predictable customer experience. Under times of market volatility and high volumes, there is an increased expectation of a consistent service level. We strive to meet this expectation by building reliability into our ecosystem. This will be achieved though defining & implementing practices in Resiliency Engineering, Automation, Observability & Chaos Testing while also engraining a proactive culture that thinks reliability-first design.
Solve stack-wide engineering issues related to hardware, software, network, applications, and cloud service providers.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).