When Do Models Learn What We Didn’t Mean to Teach Them?

Aug 27, 2025

This week, we unpack how bias creeps into LLMs—through training, data, or deliberate manipulation—and how researchers are working to detect and defend against it.

One Research Paper

This paper offers a thorough look at how social bias shows up in large language models—how it starts, how it sticks, and why it matters. It introduces clear frameworks for measuring bias at different stages of a model’s output and organising ways to reduce it across the ML pipeline

One Article

This article explores how attackers can stealthily inject bias and misinformation into large language models—without ever touching the model itself. A sharp, example-rich dive into how LLMs can be manipulated through the very data they rely on.

One Infographic

Source: nature.com

One Guide

AI models don’t just pick up bias—they can memorise it. This NIST blog dives into how differential privacy can prevent overfitting to sensitive patterns and support fairer outcomes at scale.

One Cartoon

Source: Reddit