Site Reliability Engineer; SRE - AI Platform & Cloud Job Alpharetta area,Georgia USA,IT/Tech

Position: Site Reliability Engineer (SRE) - AI Platform & Cloud

In the Technology division, we leverage innovation to build the connections and capabilities that power our Firm, enabling our clients and colleagues to redefine markets and shape the future of our communities.

This is a Software Engineering position at Director level, which is part of the job family responsible for developing and maintaining software solutions that support business needs.

Since 1935, Morgan Stanley is known as a global leader in financial services, always evolving and innovating to better serve our clients and our communities in more than 40 countries around the world.

Our mission is to develop a firmwide Artificial Intelligence (AI) Development Platform that aligns with the firm’s Technology principles and drives efficiency and consistency, controls, security and strong governance and promotes innovation, enabling teams to build applications that leverage AI capabilities and accelerate the adoption of AI across our businesses.

This role is for an experienced and driven Site Reliability Engineer (SRE) to join our AI Platform team to help support, scale and harden the infrastructure that powers our AI/ML systems. You will collaborate closely with infrastructure engineering, cloud engineering, data engineering, and security teams to ensure availability, reliability, performance, and security of production AI workloads (training, inference, data pipelines) in a regulated, high-stakes financial environment.

As an SRE on the AI platform, you will bring deep operations, automation, and systems engineering skills to enable our models and pipelines to run reliably at scale, while balancing cost, security, and compliance constraints.

The ideal candidate will have strong hands-on experience supporting software platforms on any combination of the following platforms - Kubernetes, Cloud (AWS, Azure, and/or Google), API based development, REST framework, data engineering, and large-scale API Gateway environments etc. Knowledge of AIML and hands-on experience implementing solutions using Generative AI are also preferable. The candidate will have great communication skills, a team-based mentality and a strong passion for using AI to increase productivity as well as help generate new ideas for product & technical improvements.

What you'll do in the role:

Operate, monitor, and maintain the infrastructure supporting GenAI applications (training, inference, feature store, data ingestion, model serving)
Design and build automation for core platform capabilities, reducing manual toil
Develop and maintain infrastructure-as-code (IaC) for provisioning and managing compute, storage, network, GPU clusters, Kubernetes / container orchestration, etc.
Establish, monitor, and enforce SLOs/SLIs/SLAs, error budgets, alerting, and dashboards
Lead incident response, root cause analysis (RCA), postmortems, and systemic remediation
Perform capacity planning, scaling strategies, workload scheduling, and resource forecasting
Optimize cost vs. performance tradeoffs in large-scale compute environments
Harden systems for security, compliance, auditability, and data governance
Collaborate across teams (cloud engineers, data engineers, infrastructure, security) to ensure safe deployment, rollout, rollback, and integration of new systems
Define disaster recovery (DR) strategies, backup/restore practices, fault tolerance mechanisms
Maintain runbooks, operational playbooks, documentation, and training materials
Participate in on-call rotations and respond to production incidents 24/7 as needed
Continuously evaluate and integrate new tools, frameworks, or technologies to enhance platform reliability

What you'll bring to the role:

Bachelor’s or Master’s degree in Computer Science or related field, or equivalent job experience
5 years of production experience in SRE / Infrastructure / ops for large-scale systems
Strong programming/scripting skills (Python, Go, Java, or equivalent)
Deep experience with containerization (Docker), orchestration (Kubernetes, etc.)
Infrastructure-as-code (Terraform, Helm, Cloud Formation, Ansible, etc.)
Familiarity with GPU / AI compute clusters, high-performance data storage, and distributed architectures
Experience…