Lead,SRE Job Lead South Dakota USA,IT/Tech

Location: Lead

Job Summary

Design, implement, and maintain scalable, reliable, and highly available systems.
Develop and maintain automation tools for infrastructure provisioning, monitoring, and incident response.
Collaborate with development teams to improve system operability and ensure reliability best practices are followed.
Monitor system performance, identify bottlenecks, and implement solutions to improve reliability and scalability.
Troubleshoot production issues, perform root cause analysis, and implement fixes to prevent recurrence.
Collaborate with devops team to build and maintain CI/CD pipelines to automate deployments and testing.
Implement and manage monitoring, alerting, and logging solutions using tools like Prometheus, Grafana, and Loki.
Ensure systems are secure and compliant with organizational policies and standards.
Conduct post-incident reviews and drive improvements to reduce mean time to recovery (MTTR).
Champion a culture of reliability, automation, and continuous improvement within the team.
To be successful, the candidate will have a strong understanding of system reliability principles and will work to achieve the related business objectives:
Effectively manage themselves and their tasks during the project lifecycle.
Identify and mitigate risks that could impact system reliability and availability.
Engage with multiple stakeholders and vendors to ensure alignment on reliability goals.

Key Responsibilities Strategy

8+ years of experience in Site Reliability Engineering or Dev Ops, with a strong focus on automation, monitoring, and system reliability.

Business

Strong experience in designing and implementing scalable, reliable, and fault-tolerant systems.
Proficient in infrastructure automation tools like Terraform, Ansible, or equivalent.
Hands‑on experience with CI/CD tools like Jenkins, Azure Dev Ops (ADO), or Git Lab CI/CD.
Strong knowledge of monitoring and observability tools such as Prometheus, Grafana, Loki, or equivalent.
Proficient in scripting and automation using Python, Bash, or similar languages.
Experience with containerization (Docker, Podman) and orchestration platforms (Kubernetes).
Strong understanding of cloud platforms (AWS, Azure, or GCP) and infrastructure as code (IaC) principles.
Experience in troubleshooting and optimizing Linux‑based systems.
Hands‑on experience in setting up and managing logging and alerting systems.
Experience in conducting post‑incident reviews and implementing reliability improvements.
Familiarity with security best practices and compliance standards.

Desired skills (good to have)

Exposure to Generative AI and knowledge/experience in implementing AI solutions for system reliability.
Experience with chaos engineering tools to test system resilience.
Knowledge of database performance tuning and optimization.
Experience with service mesh technologies like Istio or Linkerd.

Processes

People & Talent

Risk Management

Governance

Responsible for assessing the effectiveness of the Group's arrangements to deliver effective governance, oversight and controls in the business and, if necessary, oversee changes in these areas;
Awareness and understanding of the regulatory framework, in which the Group operates, and the regulatory requirements and expectations relevant to the role.

Regulatory & Business Conduct

Display exemplary conduct and live by the Group’s Values and Code of Conduct.
Take personal responsibility for embedding the highest standards of ethics, including regulatory and business conduct, across Standard Chartered Bank. This includes understanding and ensuring compliance with, in letter and spirit, all applicable laws, regulations, guidelines and the Group Code of Conduct.
Effectively and collaboratively identify, elevate, mitigate and resolve risk, conduct and compliance matters.

Skills and Experience


Increase/decrease your Search Radius (miles)



Job Posting Language

Lead, SRE