×
Register Here to Apply for Jobs or Post Jobs. X

Site Reliability Engineer SINDC

Job in Denton, Denton County, Texas, 76205, USA
Listing for: Compunnel Inc.
Full Time position
Listed on 2026-02-07
Job specializations:
  • IT/Tech
    Systems Engineer, Cloud Computing
Job Description & How to Apply Below
Position: Site Reliability Engineer -- SINDC5717546

Overview

We are currently sourcing for a Site Reliability Engineer to work in Client's Enterprise Infrastructure Group in Westlake TX or Merrimack NH.

Shift: When on call 10 am EST – 8 pm EST; when not on call M-F 9:00 am – 5:00 pm. On call (twice a week, one of those days may be weekend)

Responsibilities
  • Provide enterprise Cloud and Platform Engineering support for production environments and participate in on-call rotation to provide solutions.
  • Lead all aspects of production support—readiness, availability and resiliency of critical Applications, Batches & Infrastructure representing various business units while being centrally aligned to the Production Services organization.
  • Define & implement practices in resiliency engineering, automation, observability & chaos testing to improve system reliability.
  • Solve stack-wide engineering issues related to hardware, software, network, applications, and cloud service providers.
  • Balance delivery with ad hoc workloads and re-evaluate priorities as needed.
  • Triaging, root cause analysis and decisive problem solving under pressure.
  • Maintain scalability and resiliency of a complex environment; implement advanced observability practices and techniques at scale.
  • Provide instrumentation and operations for building and operating, monitoring, logging, and alerting services of distributed systems at scale.
  • Manage and interpret large datasets using query languages and visualization tools.
  • Ensure familiarity with ITIL processes such as incident management, change management and problem management.
  • Experience with on-prem and cloud environments (AWS and Azure), including building and operating highly resilient platforms in public clouds; experience with migration skills.
  • Experience with container orchestration (preferably Kubernetes) and Dev Ops concepts including CI/CD pipelines.
  • Hands-on experience with observability tools (Prometheus, Grafana, ELK/Open Search, Open Telemetry, Datadog, etc.).
  • Use Datadog, Catchpoint, Splunk & Grafana for application observability and monitoring of app & infrastructure.
  • Experience with infrastructure as code tools (IAM, ARM, Terraform, Chef).
  • Handle a large fleet of on-prem servers (including security & patching oversight) and manage hundreds of SSL certificates for all applications in scope.
  • Proven experience performing chaos testing to build confidence in the system's ability to withstand turbulent conditions in production.
  • Strong communication skills to reach both technical and non-technical audiences.
Qualifications
  • Bachelor’s degree or equivalent experience or higher in a technology-related field (e.g., Engineering, Computer Science);
    Master’s degree a plus.
  • 5-8+ years of hands-on experience deploying and/or supporting highly distributed multi-tiered systems at scale.
  • Hands-on experience with public cloud environments, preferably AWS and Azure; certifications a plus.
  • Exposure to basic OS-level scripting languages (Korn/Bash/JavaScript).
  • On-call experience running incidents.
  • Experience with one or more observability tools (Prometheus, Grafana, ELK/Open Search, Open Telemetry, Datadog, etc.).
The Team

The team comes from diverse technical backgrounds, and the responsibilities provide the opportunity for a variety of challenges. Ideal candidates will have a background in either software engineering or systems engineering with a desire to learn the other or previous experience as an SRE.

The Role

You will have the opportunity to lead all aspects of production support - readiness, availability and resiliency of critical Applications, Batches & Infrastructure representing various business units while being centrally aligned to the Production Services organization. Offer a plethora of opportunities to augment knowledge across multiple dimensions of Technology at the same time retaining key focus on Cloud Computing (AWS & Azure) & Enterprise tools/solutions like Jenkins, uDeploy, Docker, Kubernetes, Splunk, Datadog, etc.

This role will provide a truly predictable customer experience. Under times of market volatility and high volumes, there is an increased expectation of a consistent service level. We strive to meet this expectation by building reliability into our ecosystem. This will be achieved though defining & implementing practices in Resiliency Engineering, Automation, Observability & Chaos Testing while also engraining a proactive culture that thinks reliability-first design.

Solve stack-wide engineering issues related to hardware, software, network, applications, and cloud service providers.

#J-18808-Ljbffr
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
 
 
 
Search for further Jobs Here:
(Try combinations for better Results! Or enter less keywords for broader Results)
Location
Increase/decrease your Search Radius (miles)

Job Posting Language
Employment Category
Education (minimum level)
Filters
Education Level
Experience Level (years)
Posted in last:
Salary