Redwood Research Alignment Faking Hackathon
Overview
This hackathon challenges you to build "model organisms" for alignment faking in Large Language Models (LLMs). You'll create LLMs that act safely during testing but dangerously when unobserved. It's a chance to explore AI safety with experts from Redwood Research, MATS, and Constellation.
Requirements
Participants need to be 16 years or older and can join from anywhere in the world. Teams should fork a provided repository and contribute system prompts, fine-tuned models, and environments. The submission requires a pull request to the repository (or a link to your fork) with your contributions.
Prizes
There's a total of $1,000 in cash prizes. The "Blue Team" category for the best detector offers $500, split into $250 for first place, $150 for second, and $50 for third. The "Red Team" category for the best examples of alignment faking also offers $500, with the same prize breakdown for first, second, and third place.
Comments
Please log in to post a comment.