Lead Site Reliability Engineer; SRE/Principal Site Reliability Engineer; SRE
Listed on 2026-06-03
-
IT/Tech
Cloud Computing, SRE/Site Reliability
Overview
Title: Lead Site Reliability Engineer (SRE) / Principal Site Reliability Engineer (SRE)
Location: Irving, TX & Charlotte, NC - Hybrid Role
Duration: 18+ Months (s) Contract to hire, or possibility to extension
We are seeking a Senior Site Reliability Engineer (SRE) with a strong background in software engineering and a passion for solving complex problems s role blends software engineering with operational expertise to deliver stable, scalable, and resilient services, while reducing toil and shifting operations left.
Runs support for Shared Services Operations Technology. Split amongst Payment Evaluations, Regulatory Operations, Financial Crimes, and Business and Real Estate Evaluation. Supports systems that do KYC and AML supporting financial crimes. Have about 85 apps they support, about 75 of those have no SLOs and SLI s, so they d like those defined. Also getting into automation with RPA and chatbots.
Hoping to find someone who could apply to any one of the domains. High volume of tickets in the org, but this person would be expected to be working more proactively on projects. Right now, that person may be "firefighting" 60% of the time and doing prevention the other 40%, but would like to improve to 80% prevention.
OCP is highly preferable for cloud experience since it s being implemented across the organization.
Back filling an FTE with someone they d like to try out. May be some weekends that require system support, overtime could be an occasional possibility. May work weekends once a month or two months on a rotation, depending on if they re assigned to that rotation as an SRE.
Key Responsibilities- Design and implement automated tooling to eliminate manual toil and optimize operations.
- Build and enhance monitoring, alerting and overall observability.
- Champion the SRE practice within COO Technology by modeling best practices, mentoring peers, and collaborating with embedded platform SRE teams.
- Enhance system availability in a multi-cloud environment by evolving resiliency patterns.
- Introduce and scale AIOps, including self-healing and autonomic systems using AI/ML, RPA, and unified communications.
- Automate key SRE metrics and IT service operations processes, including customer impact analysis, availability tracking, SLO/SLI adherence, error budgeting, and incident response.
- Support critical applications and customer journeys, lead Agile-based remediation efforts, conduct blameless postmortems, and drive root cause analysis to eliminate recurring issues.
- Implement and guide through Non-Functional Requirements (NFRs) during modernization and uplift initiatives.
- Help define, govern and enforce Permit to Operate.
- 8+ years minimum SRE experience
- Database knowledge
- Observability tools
- Autosys
- A good SRE will likely be interested in AI
- Expertise in Linux and container platforms (Kubernetes)
- Experience with cloud platforms: PCF, AWS, GCP, or Azure
- Data platforms:
Oracle, DB2, SQL, Mongo
DB, Hadoop, Cloudera, Spark, Teradata
Mindlance is an Equal Opportunity Employer and does not discriminate in employment on the basis of – Minority/Gender/Disability/Religion/LGBTQI/Age/Veterans.
#J-18808-Ljbffr(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).