Cloud Evals Infrastructure Engineer Job Berkeley area,California USA,IT/Tech

METR is looking for an infrastructure engineer to manage our cloud services, notably the deployment of the open source LLM eval tooling Inspect and our cloud-native wrapper Hawk.

About METR

METR is a non-profit that conducts empirical research to determine whether frontier AI models pose a significant threat to humanity. It is robustly good for civilization to have a clear understanding of what types of danger AI systems pose, and know how high the risk is. You can learn more about our goals from our published talks (overall goals, recent update).

Some

highlights of our work so far:

Establishing autonomous replication evals
:
Thanks to our work, it’s now taken for granted that autonomous replication (the ability for a model to independently copy itself to different servers, obtain more GPUs, etc) should be tested for.

Pre-release evaluations
:
We’ve worked with OpenAI and Anthropic to evaluate their models pre‑release, and our research has been widely cited by policymakers, AI labs, and within government.

Inspiring lab evaluation efforts
:
Multiple leading AI companies are building their own internal evaluation teams, inspired by our work.

Early commitments from labs
:
The safety frameworks of Google Deep Mind, OpenAI, and Anthropic all credit or endorse our work in developing responsible scaling policies.

We have been mentioned by the UK government, Time Magazine, and others. We’re sufficiently connected to relevant parties (labs, governments, and academia) that any good work we do or insights we uncover can quickly be leveraged.

Required Qualifications

Minimum eight years of professional experience working with cloud infrastructure
Demonstrated expertise with AWS services, in particular non‑trivial IAM configurations, EKS, ECS, Lambda, Cloud Watch, RDS Aurora
Python development skills
Infrastructure as Code experience:
Terraform, CDK, or Pulumi
CI/CD workflows, Git Hub Actions
Proven experience in systems administration, with strong knowledge of user administration on Linux systems (user creation, SSH access, etc.)
Experience managing and integrating various SaaS platforms and identity management systems

Key Responsibilities

Manage our cloud infrastructure (AWS with Terraform and Pulumi) and non‑infrastructure service providers (external GPU providers, LLM inference providers)
Implement and proactively help team members implement best practices for the usage of containerization services (Docker, Kubernetes), including Nvidia GPU (via Nvidia container toolkit) on AWS
Manage our deployment processes (Terraform, Pulumi, Git Hub Actions)
Manage our networking infrastructure (Tailscale, Cilium, AWS VPC) and make adjustments as needed to enforce security restrictions and implement research‑driven requests
Advise and implement best practices to increase scalability, reliability, and cost‑effectiveness of our systems (order of many thousands of concurrent running containers)
Opportunities to advise on and/or help implement our growing data pipelines
Keeping up‑to‑date on industry trends and best practices for organizational practices involving infrastructure, including but not limited to IaC, CI/CD, serverless stacks, event‑driven frameworks,
Contribute to infrastructure observability and monitoring (Cloud Watch, Data Dog)
Proactively improve our architecture, internal/public workflows, and security policies
Share responsibilities for some IT tasks (MDM, Okta, Google Work spaces, SSO)
Manage user access and permissions across multiple platforms (AWS, Google Workspace, Git Hub, Tailscale, Auth0)
Streamline new hire onboarding and access management processes
Serve as the primary point of contact for technical support, building playbooks to resolve common issues, and escalating to other internal teams or external support where needed.
Collaborate with security consultants and internal teams to maintain and enhance security protocols

Nice to Haves

Background in supporting researchers and software engineers
Familiarity with the wacky world of AI safety
Deeper knowledge of LLMs than your average engineer
Knowledge of security best practices and compliance requirements (e.g. SOC2)
Pulumi IaC with Python
Data engineering skills, e.g.…