Member of Technical Staff, ML Infra, AGI
Listed on 2025-12-28
-
IT/Tech
Cloud Computing, Systems Engineer
Member of Technical Staff, ML Infra, AGI
Job | Services LLC - A57
We’re looking for a driven and talented Member of Technical Staff to join our AGI Autonomy organization and build state‑of‑the‑art agents.
Our lab is a small, talent‑dense team with the resources and scale of Amazon. Each team has the autonomy to move fast and a long‑term commitment to pursue high‑risk, high‑payoff research. We’re entering an exciting new era where agents can redefine what AI makes possible. We’d love for you to join us and build it from the ground up!
Key Job Responsibilities- Design, build, and maintain the compute platform that powers all AI research at the SF AI Lab, managing large‑scale GPU pools and ensuring optimal resource utilization.
- Partner directly with research scientists to understand experimental requirements and develop infrastructure solutions that accelerate research velocity.
- Implement and maintain robust security controls and hardening measures while enabling researcher productivity and flexibility.
- Modernize and scale existing infrastructure by converting manual deployments into reproducible Infrastructure as Code using AWS CDK.
- Optimize system performance across multiple GPU architectures, becoming an expert in extracting maximum computational efficiency.
- Design and implement monitoring, orchestration, and automation solutions for GPU workloads at scale.
- Ensure infrastructure is compliant with Amazon security standards while creatively solving for research‑specific requirements.
- Collaborate with AWS teams to leverage and influence cloud services that support AI workloads.
- Build distributed systems infrastructure, including Kubernetes‑based orchestration, to support multi‑tenant research environments.
- Serve as the bridge between traditional systems engineering and ML infrastructure, bringing enterprise‑grade reliability to research computing.
This role is part of the foundational infrastructure team at the SF AI Lab, responsible for the platform that enables all research across the organization. Our team serves as the critical link between Amazon’s enterprise infrastructure and the Lab’s research needs. We are experts in performance optimization, systems architecture, and creative problem‑solving—finding ways to push the boundaries of what’s possible while maintaining security and reliability standards.
We work closely with research scientists, understanding their experimental needs and translating them into robust, scalable infrastructure solutions. Our team has deep expertise in ML framework internals and GPU optimization, but we’re also pragmatic systems engineers who build traditional infrastructure with enterprise‑grade quality. We value engineers who can balance research velocity with operational excellence, who bring curiosity about ML while maintaining strong fundamentals in systems engineering.
BasicQualifications
- 5+ years of professional experience in systems development, Dev Ops, or infrastructure engineering.
- Hands‑on experience with AWS services and cloud infrastructure (EC2, VPC, S3, IAM, Cloud Formation/CDK, etc.).
- Programming skills in Python, Go, or similar languages for infrastructure automation.
- Experience building and maintaining production systems at scale.
- Demonstrated ability to troubleshoot complex distributed systems issues.
- Knowledge of security best practices and experience implementing security controls.
- Experience with Infrastructure as Code (IaC) principles and tools.
- Knowledge of AWS CDK and Cloud Formation for infrastructure automation.
- Networking experience (VPC design, network security, performance optimization).
- Security hardening experience in cloud environments, including compliance frameworks.
- Experience with Kubernetes and container orchestration at scale.
- Familiarity with GPU computing, CUDA, and ML framework internals (PyTorch, Tensor Flow, Ray).
Amazon is an equal‑opportunity employer and does not discriminate on the basis of protected veteran status, disability, or other legally protected status.
Los Angeles County applicants:
Job duties for this position include working safely and cooperatively with other employees, supervisors, and…
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).