back

Reward Mismatches in RL Cause Emergent Misalignment

Get SIGNAL/NOISE in your inbox daily

Learning to do misaligned-coded things anywhere teaches an AI (or a human) to do misaligned-coded things everywhere. So be sure you never, ever teach…