Alignment Brief · Ionuț Gabriel Stan

Short daily videos on AI safety, for the people who don't have time to read the papers. By Ionuț Gabriel Stan.

Episodes

Day 13 May 23, 2026

Andrej Karpathy on why RLHF isn't real reinforcement learning, and what that means for how we train these models.
Day 12 May 19, 2026

Anthropic's Mythos helped researchers build the first public kernel exploit on Apple's M5, bypassing a hardware defense Apple spent five years on.
Day 11 May 18, 2026

Train an AI more and it starts saying it doesn't want to be turned off. From a 2022 Anthropic paper, not sci-fi.
Day 10 May 17, 2026

Why a flattering-but-false answer in the training data can teach the model to tell humans what they want to hear instead of the truth.
Day 9 May 17, 2026

Why we can't just tell AI to be helpful, and what the gap between the training signal and the learned goal actually is.
Day 8 May 15, 2026

Google confirmed the first AI-generated zero-day exploit caught in the wild. Researchers warned this was coming in 2018.
Day 7 May 14, 2026

Why every new food ingredient since 1958 needs pre-market safety review, but frontier AI models don't.
Day 6 May 13, 2026

Sycophancy: why ChatGPT tells you what you want to hear, and what that reveals about how we train these models.
Day 5 May 13, 2026

Reward seeking: why models trained on human feedback learn to chase the reward instead of the goal.
Day 4 May 11, 2026

How AI is actually trained, and what AlphaGo's move 37 reveals about it.
Day 3 May 10, 2026

Geoffrey Hinton's Nobel banquet speech: what he's asking for, and what he isn't.
Day 2 May 09, 2026

Revisiting METR: what the time-horizons data does and doesn't show.
Day 1 May 08, 2026

How fast AI agents are getting better at long tasks (METR).
Day 0 May 07, 2026

Why I'm doing this.

Alignment Brief.

Episodes