Evals Demonstration Engineer (Contract)
London
Evals Team /
Contract - Temp /
On-site
Applications deadline: 31 Oct 2025. We review applications on a rolling basis and encourage early submissions.
The Opportunity 
We're seeking an Evals Demonstration Engineer who will design new demonstrations and translate our technical evaluation findings into compelling and accessible demonstrations for policymakers and non-technical stakeholders (e.g. government officials, think tanks).
This role offers the opportunity to shape how frontier AI risks and our work are communicated to those with the power to address them. You will also act as the bridge between Apollo’s internal technical research and governance teams. This role requires a unique mix of technical understanding, policy acumen, creativity and communication skills. 
This 6-month full-time contract role. Depending on your performance and needs of Apollo, this role might transition into a permanent role.
Key Responsibilities
- Build a library of reusable self-contained demos using the Inspect framework and our internal code base
- Develop interactive and effective demonstrations in the right medium (web-based and interactive, visual, video, report etc. ) that clearly communicate our evaluation findings, AI capabilities and risks that we care about
- Here are examples of demos/visuals that we generally like: Anthropic Interpretability, Anthropic persona vectors, 3Blue1Brown, AI2027 video, OWID graphs, EPOCH graphs
- The visualizations will likely mostly revolve around transcripts rather than graphs. For example, these may look like better versions of our in-context scheming snippets
- Deliver live demonstrations and presentations to policymakers and non-technical stakeholders with clarity and coherence
- Create clear documentation and guides to enable Apollo team members to present demos effectively
- Rapidly prototype and iterate using evidence-based methods e.g. user testing, analytics and stakeholder feedback
- Collaborate with internal research and governance teams to ensure that the content produced accurately represent our work and concerns, as well as tailored to our specific policy audience and objectives
Job Requirements
- Proven ability to choose and execute technical demonstrations in the right medium (web-based and interactive, visual, video, report etc.) based on audience needs and measured effectiveness
- Solid Python programming skills (in order to run evaluations in Inspect and modify tasks to fit demo needs)
- Exceptional verbal and written communication skills, specifically the ability to explain complex concepts simply
- Experience working with or presenting to policymakers and non-technical audiences, with awareness of policy communication requirements and decision-making
- Familiarity with Inspect (ability to run and modify evaluations and build small agent evaluations in Inspect)
- Proven ability to measure what resonates with audiences and systematically iterate content effectiveness based on evidence
- Self-directed work style with ability to execute independently on projects
- High attention to detail and persistence e.g. you'll need to review hundreds of transcript lines to find the perfect examples
- Ability to travel occasionally to key policy locations (e.g., Washington D.C., Brussels, London)
- Previous experience with AI/ML evaluation frameworks
- Prior experience working in a government agency, think tank or policy organization
- Background in user experience (UX) design
Nice to haves
We want to emphasize that people who feel they don’t fulfill all of these characteristics but think they would be a good fit for the position nonetheless are strongly encouraged to apply. We believe that excellent candidates can come from a variety of backgrounds and are excited to give you opportunities to shine. 
Representative Projects
- Create at least 5 high-impact demonstrations used in meetings with policymakers and non-technical stakeholders
- Build a comprehensive library of demonstrations covering our key evaluation findings
- Publish 2 standalone blog posts highlighting and explaining some of our most important findings to a less technical audience.
About the team
- You will work both with the evals research and the governance teams. Marius Hobbhahn manages and advises the Evals team and Charlotte Stix leads the governance team.
- You can find our full team here.
Logistics
- Contract Duration: 6 months with possibility of extension and conversion to permanent
- Compensation: £7,500 GBP per month (approximately $10,000 USD)
- Start Date: Target of 2-3 months after the first interview
- Location: The office is in London, and the building is shared with the London Initiative for Safe AI (LISA) offices. This is an in-person role.
- Work Visas: Due to the current short-term nature of the role, we are prioritising candidates who have the right to work in the UK. If you think you have an exceptional profile but don't have the right to work, please apply anyway.
Benefits
- Flexible work hours and schedule
- Lunch, dinner, and snacks are provided for all employees on workdays
- Paid work trips, including staff retreats, business trips, and relevant conferences
- Private medical insurance
- Statutory benefits apply
- Potential pathway to full-time employment
- Opportunity to work on cutting-edge AI safety research
- Collaborative environment with leading researchers
- Central London location
About Apollo
- The capabilities of current AI systems are evolving at a rapid pace. While these advancements offer tremendous opportunities, they also present significant risks, such as the potential for deliberate misuse or the deployment of sophisticated yet misaligned models.
- At Apollo Research, our primary concern lies with deceptive alignment, a phenomenon where a model appears to be aligned but is, in fact, misaligned and capable of evading human oversight.
- Our approach focuses on behavioral model evaluations, which we then use to audit real-world models. In our evaluations, we focus on LM agents, i.e. LLMs with agentic scaffolding similar to AIDE or SWE agent.
- At Apollo, we aim for a culture that emphasizes truth-seeking, being goal-oriented, giving and receiving constructive feedback, and being friendly and helpful. If you’re interested in more details about what it’s like working at Apollo, you can find more information here.
Equality Statement: Apollo Research is an Equal Opportunity Employer. We value diversity and are committed to providing equal opportunities to all, regardless of age, disability, gender reassignment, marriage and civil partnership, pregnancy and maternity, race, religion or belief, sex, or sexual orientation.
Our streamlined interview process includes: 
- Application review with detailed questionnaire
- Screening call (30 minutes)
- Work test (3 hours): Create a 3-minute demonstration from a provided evaluation (screen recording required)
- Technical interview 1 (30 mins) with Evals Researcher
- Technical interview 2 (60 minutes) with Charlotte (Head of AI Governance)
- Final interview (30 minutes) with Marius (CEO)
Your Privacy and Fairness in Our Recruitment Process
We are committed to protecting your data, ensuring fairness, and adhering to workplace fairness principles in our recruitment process. To enhance hiring efficiency and minimize bias, we use AI-powered tools to assist with tasks such as resume screening and candidate matching. These tools are designed and deployed in compliance with internationally recognized AI governance frameworks. 
Your personal data is handled securely and transparently, and final hiring decisions are made by our recruitment team to ensure a human-centered approach. If you have questions about how your data is processed or wish to report concerns about fairness, please contact us at info@apolloresearch.ai.
