Senior Machine Learning RE/RS
Berkeley
Open Positions – Engineering & Research /
Employee /
Hybrid
We are offering a $21k referral bonus for this role. You can refer people through our form, and it lists the terms of this bonus.
METR assesses AI capabilities, risks and mitigations, with a specific focus on threats related to autonomy, AI R&D automation, and alignment. This primarily means advancing the science of AI measurement, with a particular focus on understanding frontier AI systems’ ability to complete complex tasks without human input. It also involves executing those measurements, directly assessing the capability of frontier AI systems, to inform risk assessments and consensus within the AI industry, among policymakers, and the public.
About METR
METR is a non-profit that conducts empirical research to determine whether frontier AI models pose a significant threat to society. It is robustly good for civilization to have a clear understanding of what types of danger AI systems pose, and know how high the risk is. You can learn more about our goals from (for example) this podcast episode.
Some highlights of our work so far:
- Measuring AI ability to complete long tasks: We proposed measuring AI performance in terms of the length of tasks AI agents can complete. We show that this metric has been consistently exponentially increasing over the past 6 years, with a doubling time of around 7 months. Extrapolating this trend predicts that, in under a decade, we will see AI agents that can independently complete a large fraction of software tasks that currently take humans days or weeks.
- Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity : We conduct a randomized controlled trial (RCT) to understand how early-2025 AI tools affect the productivity of experienced open-source developers working on their own repositories. Surprisingly, we find that when developers use AI tools, they take 19% longer than without—AI makes them slower.
- Pre-release evaluations: We’ve worked with OpenAI and Anthropic to evaluate their models pre-release, and our research has been widely cited by policymakers, AI labs, and within government.
- Inspiring AI developers work on AI risk management: We helped pioneer the ideas surrounding Frontier AI Safety Policies, a now standard approach for using AI dangerous capability evaluations to govern the risk from frontier AI systems. Several prominent AI developers have cited our assistance or contributions in creating their frameworks, including Anthropic, Google Deepmind, Magic, Amazon, and G42.
- Establishing autonomous replication evals: Thanks to our work, it’s now taken for granted that autonomous replication (the ability for a model to independently copy itself to different servers, obtain more GPUs, etc) should be tested for. For example, labs pledged to evaluate for this capability as part of the White House commitments.
Our work has been cited by NIST, a US President, the UK Government, Nature News, The New York Times, Time Magazine, and others. We’re sufficiently connected to relevant parties (labs, governments, and academia) that any good work we do or insights we uncover can quickly be leveraged.
Logistics
Compensation: $340,000—$450,000 USD salary range plus employee benefits. For very experienced exceptional researchers we are open to exploring paying much higher than this stated range.
Hybrid Requirements: Most of our technical team members are in our office in Berkeley, CA 4x/week. We will make exceptions for exceptionally strong candidates.
We encourage you to apply even if your background may not seem like the perfect fit! We would rather review a larger pool of applications than risk missing out on a promising candidate for the position. If you lack US work authorization and would like to work in-person (preferred), we can likely sponsor a cap-exempt H-1B visa for this role.
We are committed to diversity and equal opportunity in all aspects of our hiring process. We do not discriminate on the basis of race, religion, national origin, gender, sexual orientation, age, marital status, veteran status, or disability status. We welcome and encourage all qualified candidates to apply for our open positions.
Responsibilities
- These are relevant examples, but we don't expect every candidate to do all of these:
- Run pre-deployment evaluations on frontier models, sometimes in collaboration with partners in governments or frontier AI companies.
- Design and implement autonomy tasks/benchmarks.
- Rapidly execute experiments to elicit agent capabilities on complex tasks.
- Study the effects of inference compute scaling on agent capabilities.
- Design economic models for AI R&D acceleration and recursive self-improvement, and collect evidence that informs key parameters.
- Conduct RCTs measuring the impacts of AI systems on AI R&D productivity.
- Our technical team is currently small (~20 people), so you will have an opportunity to shape our future direction.
What We're Looking For
- A strong ML publication record (e.g. have published several first-author papers at leading ML conferences)
- Experience working on post-training or agent elicitation teams at frontier AI labs.
- We're especially interested in folks with research and/or people management experience.
- Strong data science and statistics skills are a plus.
- Alignment with our team—as a mission-driven organization, we're particularly interested in candidates that are excited about understanding AI agent capabilities and evaluating safety/control methodologies.
$340,000 - $450,000 a year