Alignment Brief.
Short daily videos on AI safety, for the people who don't have time to read the papers. By Ionuț Gabriel Stan.
Episodes
-
Day 13
Andrej Karpathy on why RLHF isn't real reinforcement learning, and what that means for how we train these models.
-
Day 12
Anthropic's Mythos helped researchers build the first public kernel exploit on Apple's M5, bypassing a hardware defense Apple spent five years on.
-
Day 11
Train an AI more and it starts saying it doesn't want to be turned off. From a 2022 Anthropic paper, not sci-fi.
-
Day 10
Why a flattering-but-false answer in the training data can teach the model to tell humans what they want to hear instead of the truth.
-
Day 9
Why we can't just tell AI to be helpful, and what the gap between the training signal and the learned goal actually is.
-
Day 8
Google confirmed the first AI-generated zero-day exploit caught in the wild. Researchers warned this was coming in 2018.
-
Day 7
Why every new food ingredient since 1958 needs pre-market safety review, but frontier AI models don't.
-
Day 6
Sycophancy: why ChatGPT tells you what you want to hear, and what that reveals about how we train these models.
-
Day 5
Reward seeking: why models trained on human feedback learn to chase the reward instead of the goal.
-
Day 4
How AI is actually trained, and what AlphaGo's move 37 reveals about it.
-
Day 3
Geoffrey Hinton's Nobel banquet speech: what he's asking for, and what he isn't.
-
Day 2
Revisiting METR: what the time-horizons data does and doesn't show.
-
Day 1
How fast AI agents are getting better at long tasks (METR).
-
Day 0
Why I'm doing this.