Lead, SRE
Job in
Lead, Lawrence County, South Dakota, 57754, USA
Listed on 2026-02-21
Listing for:
Standard Chartered
Full Time
position Listed on 2026-02-21
Job specializations:
-
IT/Tech
SRE/Site Reliability, Cloud Computing, Systems Engineer, IT Project Manager
Job Description & How to Apply Below
Job Summary
- Design, implement, and maintain scalable, reliable, and highly available systems.
- Develop and maintain automation tools for infrastructure provisioning, monitoring, and incident response.
- Collaborate with development teams to improve system operability and ensure reliability best practices are followed.
- Monitor system performance, identify bottlenecks, and implement solutions to improve reliability and scalability.
- Troubleshoot production issues, perform root cause analysis, and implement fixes to prevent recurrence.
- Collaborate with devops team to build and maintain CI/CD pipelines to automate deployments and testing.
- Implement and manage monitoring, alerting, and logging solutions using tools like Prometheus, Grafana, and Loki.
- Ensure systems are secure and compliant with organizational policies and standards.
- Conduct post-incident reviews and drive improvements to reduce mean time to recovery (MTTR).
- Champion a culture of reliability, automation, and continuous improvement within the team.
- To be successful, the candidate will have a strong understanding of system reliability principles and will work to achieve the related business objectives:
- Effectively manage themselves and their tasks during the project lifecycle.
- Identify and mitigate risks that could impact system reliability and availability.
- Engage with multiple stakeholders and vendors to ensure alignment on reliability goals.
- 8+ years of experience in Site Reliability Engineering or Dev Ops, with a strong focus on automation, monitoring, and system reliability.
- Strong experience in designing and implementing scalable, reliable, and fault-tolerant systems.
- Proficient in infrastructure automation tools like Terraform, Ansible, or equivalent.
- Hands‑on experience with CI/CD tools like Jenkins, Azure Dev Ops (ADO), or Git Lab CI/CD.
- Strong knowledge of monitoring and observability tools such as Prometheus, Grafana, Loki, or equivalent.
- Proficient in scripting and automation using Python, Bash, or similar languages.
- Experience with containerization (Docker, Podman) and orchestration platforms (Kubernetes).
- Strong understanding of cloud platforms (AWS, Azure, or GCP) and infrastructure as code (IaC) principles.
- Experience in troubleshooting and optimizing Linux‑based systems.
- Hands‑on experience in setting up and managing logging and alerting systems.
- Experience in conducting post‑incident reviews and implementing reliability improvements.
- Familiarity with security best practices and compliance standards.
- Exposure to Generative AI and knowledge/experience in implementing AI solutions for system reliability.
- Experience with chaos engineering tools to test system resilience.
- Knowledge of database performance tuning and optimization.
- Experience with service mesh technologies like Istio or Linkerd.
- Responsible for implementing end to end SRE engineering solutions.
- This role is not a people management role.
- This role is not a people management role.
- Responsible for assessing the effectiveness of the Group's arrangements to deliver effective governance, oversight and controls in the business and, if necessary, oversee changes in these areas;
Awareness and understanding of the regulatory framework, in which the Group operates, and the regulatory requirements and expectations relevant to the role.
- Display exemplary conduct and live by the Group’s Values and Code of Conduct.
- Take personal responsibility for embedding the highest standards of ethics, including regulatory and business conduct, across Standard Chartered Bank. This includes understanding and ensuring compliance with, in letter and spirit, all applicable laws, regulations, guidelines and the Group Code of Conduct.
- Effectively and collaboratively identify, elevate, mitigate and resolve risk, conduct and compliance matters.
- Site Reliability Engineering
- Infrastructure Automation (Terraform, Ansible)
- CI/CD Tools (Jenkins, Azure Dev Ops)
- Monitoring and Observability (Prometheus, Grafana, Loki)
- Java AND/OR Python
- Linux System Administration
- Scripting…
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
Search for further Jobs Here:
×