Site Reliability Engineer SINDC Job Denton area,Texas USA,IT/Tech

Position: Site Reliability Engineer -- SINDC5717546

Overview

We are currently sourcing for a Site Reliability Engineer to work in Client's Enterprise Infrastructure Group in Westlake TX or Merrimack NH.

Shift: When on call 10 am EST – 8 pm EST; when not on call M-F 9:00 am – 5:00 pm. On call (twice a week, one of those days may be weekend)

Responsibilities

Provide enterprise Cloud and Platform Engineering support for production environments and participate in on-call rotation to provide solutions.
Lead all aspects of production support—readiness, availability and resiliency of critical Applications, Batches & Infrastructure representing various business units while being centrally aligned to the Production Services organization.
Define & implement practices in resiliency engineering, automation, observability & chaos testing to improve system reliability.
Solve stack-wide engineering issues related to hardware, software, network, applications, and cloud service providers.
Balance delivery with ad hoc workloads and re-evaluate priorities as needed.
Triaging, root cause analysis and decisive problem solving under pressure.
Maintain scalability and resiliency of a complex environment; implement advanced observability practices and techniques at scale.
Provide instrumentation and operations for building and operating, monitoring, logging, and alerting services of distributed systems at scale.
Manage and interpret large datasets using query languages and visualization tools.
Ensure familiarity with ITIL processes such as incident management, change management and problem management.
Experience with on-prem and cloud environments (AWS and Azure), including building and operating highly resilient platforms in public clouds; experience with migration skills.
Experience with container orchestration (preferably Kubernetes) and Dev Ops concepts including CI/CD pipelines.
Hands-on experience with observability tools (Prometheus, Grafana, ELK/Open Search, Open Telemetry, Datadog, etc.).
Use Datadog, Catchpoint, Splunk & Grafana for application observability and monitoring of app & infrastructure.
Experience with infrastructure as code tools (IAM, ARM, Terraform, Chef).
Handle a large fleet of on-prem servers (including security & patching oversight) and manage hundreds of SSL certificates for all applications in scope.
Proven experience performing chaos testing to build confidence in the system's ability to withstand turbulent conditions in production.
Strong communication skills to reach both technical and non-technical audiences.

Qualifications

Bachelor’s degree or equivalent experience or higher in a technology-related field (e.g., Engineering, Computer Science);
Master’s degree a plus.
5-8+ years of hands-on experience deploying and/or supporting highly distributed multi-tiered systems at scale.
Hands-on experience with public cloud environments, preferably AWS and Azure; certifications a plus.
Exposure to basic OS-level scripting languages (Korn/Bash/JavaScript).
On-call experience running incidents.
Experience with one or more observability tools (Prometheus, Grafana, ELK/Open Search, Open Telemetry, Datadog, etc.).

The Team

The team comes from diverse technical backgrounds, and the responsibilities provide the opportunity for a variety of challenges. Ideal candidates will have a background in either software engineering or systems engineering with a desire to learn the other or previous experience as an SRE.

The Role

You will have the opportunity to lead all aspects of production support - readiness, availability and resiliency of critical Applications, Batches & Infrastructure representing various business units while being centrally aligned to the Production Services organization. Offer a plethora of opportunities to augment knowledge across multiple dimensions of Technology at the same time retaining key focus on Cloud Computing (AWS & Azure) & Enterprise tools/solutions like Jenkins, uDeploy, Docker, Kubernetes, Splunk, Datadog, etc.

This role will provide a truly predictable customer experience. Under times of market volatility and high volumes, there is an increased expectation of a consistent service level. We strive to meet this expectation by building reliability into our ecosystem. This will be achieved though defining & implementing practices in Resiliency Engineering, Automation, Observability & Chaos Testing while also engraining a proactive culture that thinks reliability-first design.

Solve stack-wide engineering issues related to hardware, software, network, applications, and cloud service providers.

#J-18808-Ljbffr


Increase/decrease your Search Radius (miles)



Job Posting Language