Lead Site Reliability Engineer; SRE/Principal Site Reliability Engineer; SRE Job Irving area,Texas USA,IT/Tech

Position: Lead Site Reliability Engineer (SRE) / Principal Site Reliability Engineer (SRE)

Overview

Title: Lead Site Reliability Engineer (SRE) / Principal Site Reliability Engineer (SRE)

Location: Irving, TX & Charlotte, NC - Hybrid Role

Duration: 18+ Months (s) Contract to hire, or possibility to extension

We are seeking a Senior Site Reliability Engineer (SRE) with a strong background in software engineering and a passion for solving complex problems s role blends software engineering with operational expertise to deliver stable, scalable, and resilient services, while reducing toil and shifting operations left.

Runs support for Shared Services Operations Technology. Split amongst Payment Evaluations, Regulatory Operations, Financial Crimes, and Business and Real Estate Evaluation. Supports systems that do KYC and AML supporting financial crimes. Have about 85 apps they support, about 75 of those have no SLOs and SLI s, so they d like those defined. Also getting into automation with RPA and chatbots.

Hoping to find someone who could apply to any one of the domains. High volume of tickets in the org, but this person would be expected to be working more proactively on projects. Right now, that person may be "firefighting" 60% of the time and doing prevention the other 40%, but would like to improve to 80% prevention.

OCP is highly preferable for cloud experience since it s being implemented across the organization.

Back filling an FTE with someone they d like to try out. May be some weekends that require system support, overtime could be an occasional possibility. May work weekends once a month or two months on a rotation, depending on if they re assigned to that rotation as an SRE.

Key Responsibilities

Design and implement automated tooling to eliminate manual toil and optimize operations.
Build and enhance monitoring, alerting and overall observability.
Champion the SRE practice within COO Technology by modeling best practices, mentoring peers, and collaborating with embedded platform SRE teams.
Enhance system availability in a multi-cloud environment by evolving resiliency patterns.
Introduce and scale AIOps, including self-healing and autonomic systems using AI/ML, RPA, and unified communications.
Automate key SRE metrics and IT service operations processes, including customer impact analysis, availability tracking, SLO/SLI adherence, error budgeting, and incident response.
Support critical applications and customer journeys, lead Agile-based remediation efforts, conduct blameless postmortems, and drive root cause analysis to eliminate recurring issues.
Implement and guide through Non-Functional Requirements (NFRs) during modernization and uplift initiatives.
Help define, govern and enforce Permit to Operate.

Top Skills

8+ years minimum SRE experience
Database knowledge
Observability tools

Nice to Have

Autosys
A good SRE will likely be interested in AI

Infrastructure & Cloud

Expertise in Linux and container platforms (Kubernetes)
Experience with cloud platforms: PCF, AWS, GCP, or Azure

CI/CD & Automation Observability & AIOps Operations & Data

Data platforms:
Oracle, DB2, SQL, Mongo

DB, Hadoop, Cloudera, Spark, Teradata

EEO

Mindlance is an Equal Opportunity Employer and does not discriminate in employment on the basis of – Minority/Gender/Disability/Religion/LGBTQI/Age/Veterans.

#J-18808-Ljbffr

Lead Site Reliability Engineer; SRE​/Principal Site Reliability Engineer; SRE

Lead Site Reliability Engineer; SRE/Principal Site Reliability Engineer; SRE