Senior Machine Learning Research Engineer/Scientist
Berkeley
Open Positions – Engineering & Research /
Employee /
Hybrid
We are offering a $21k referral bonus for this role. You can refer people through our form, and it lists the terms of this bonus.
METR assesses AI capabilities, risks and mitigations, with a specific focus on threats related to autonomy, AI R&D automation, and alignment. This primarily means advancing the science of AI measurement, with a particular focus on understanding frontier AI systems’ ability to complete complex tasks without human input. It also involves executing those measurements, directly assessing the capability of frontier AI systems, to inform risk assessments and consensus within the AI industry, among policymakers, and the public.
About METR
METR is a non-profit that conducts empirical research to determine whether frontier AI models pose a significant threat to society. It is robustly good for civilization to have a clear understanding of what types of danger AI systems pose, and know how high the risk is. You can learn more about our goals from our published talks (late 2023 update).
Some highlights of our work so far:
- Measuring AI ability to complete long tasks: We proposed measuring AI performance in terms of the length of tasks AI agents can complete. We show that this metric has been consistently exponentially increasing over the past 6 years, with a doubling time of around 7 months. Extrapolating this trend predicts that, in under a decade, we will see AI agents that can independently complete a large fraction of software tasks that currently take humans days or weeks.
- Pre-release evaluations: We’ve worked with OpenAI and Anthropic to evaluate their models pre-release, and our research has been widely cited by policymakers, AI labs, and within government.
- Inspiring AI developers work on AI risk management: We helped pioneer the ideas surrounding Frontier AI Safety Policies, a now standard approach for using AI dangerous capability evaluations to govern the risk from frontier AI systems. Several prominent AI developers have cited our assistance or contributions in creating their frameworks, including Anthropic, Google Deepmind, Magic, Amazon, and G42.
- Establishing autonomous replication evals: Thanks to our work, it’s now taken for granted that autonomous replication (the ability for a model to independently copy itself to different servers, obtain more GPUs, etc) should be tested for. For example, labs pledged to evaluate for this capability as part of the White House commitments.
Our work has been cited by NIST, a US President, the UK Government, Nature News, The New York Times, Time Magazine, and others. We’re sufficiently connected to relevant parties (labs, governments, and academia) that any good work we do or insights we uncover can quickly be leveraged.
Logistics
Compensation: $340,000—$450,000 USD salary range plus employee benefits. For very experienced exceptional researchers we are open to exploring paying much higher than this stated range.
We encourage you to apply even if your background may not seem like the perfect fit! We would rather review a larger pool of applications than risk missing out on a promising candidate for the position. If you lack US work authorization and would like to work in-person (preferred), we can likely sponsor a cap-exempt H-1B visa for this role.
We are committed to diversity and equal opportunity in all aspects of our hiring process. We do not discriminate on the basis of race, religion, national origin, gender, sexual orientation, age, marital status, veteran status, or disability status. We welcome and encourage all qualified candidates to apply for our open positions.
Responsibilities
- These are core examples, but we don't expect every candidate to do all of these:
- Design tasks/benchmarks and other evaluation methodologies that can determine if a model is capable of causing or enabling large-scale catastrophic harm.
- Run pre-deployment evaluations on frontier models, sometimes in collaboration with partners in governments or frontier AI companies.
- Rapidly execute experiments to determine how different elicitation techniques affect results.
- Think carefully about principled approaches to measuring AI capabilities, and run experiments to evaluate them.
- Study the effects of inference compute scaling on agent capabilities.
- Design economic models for AI R&D acceleration and recursive self-improvement, and collect evidence that informs key parameters.
- In addition to creating AI R&D evals, METR is interested in developing the "science of evals" and standards for alignment.
- Our technical team is currently small (~20 people), so you will have an opportunity to shape our future direction.
What We're Looking For
- A strong ML publication record (e.g. have published several first-author papers at leading ML conferences)
- Experience working directly on scaling/pretraining teams at frontier AI labs,Experience working on post-training teams at frontier AI labs, or
- Multiple years of experience solving challenging ML engineering or research problems at frontier AI labs.
- We're especially interested in folks with research and/or people management experience.
- Strong data science and statistics skills are a plus.
- Alignment with our team—as a mission-driven organization, we're particularly interested in candidates that are excited about understanding AI agent capabilities and evaluating safety/control methodologies.
$340,000 - $450,000 a year