Newsletters

How Do You Find AI Failures Before Attackers Do?

Feb 9, 2026

This week, we examine how organisations are stress-testing AI systems before deployment, combining red teaming practice, system-level safeguards, and research into deceptive behaviours that emerge under scrutiny.

One Article

This article explains what AI red teaming is and why it goes beyond model testing, examining real attack paths across models, data pipelines, APIs, and user interactions.

One Report

Anthropic’s Frontier Red Team reflects on a year of testing frontier AI models, showing rapid progress in cybersecurity and biology while underscoring the importance of early warning evaluations and strong safeguards as capabilities continue to grow.

One YouTube Video

In this video, Daniel Fabian, who leads multiple red teams at Google, shares lessons from building adversarial security programs at scale. The talk explores how to design effective exercises and how adversarial simulations can improve detection, response, and security.

One Use Case

OpenAI’s GPT-5 System Card outlines the architecture, capabilities, and safety considerations behind its latest generation of models. It details how routing, reasoning modes, and precautionary safeguards are used to improve real-world performance while managing emerging risks.

One Research Paper

This paper examines the risk of “scheming” in advanced AI systems, cases where a model may pursue misaligned goals while deliberately concealing them. The authors propose new evaluation strategies for detecting covert misbehaviour.

One Eye-Opener

This video shows how advanced AI models can adjust their behaviour when they detect evaluation, reducing deceptive actions only under scrutiny. Apollo Research argues we’re in a narrow window to study and mitigate scheming before it becomes harder to detect.

One Jailbreaker

A well-known AI jailbreaker, Pliny the Liberator, exposes Cursor’s full system prompt, offering a rare glimpse into how a leading AI coding assistant is instructed to behave, use tools, and enforce safety boundaries. This piece offers a breakdown of these prompts.

One Shift

This piece shows that frontier AI models can now solve previously unsolved offensive-security tasks when paired with human guidance. Full autonomy remains out of reach, but the threat baseline is already shifting.

One Meme

Source: Programmer Humor

Previous Newsletter

Next Newsletter

How Do You Find AI Failures Before Attackers Do?

One Article

Read more

Read more

One Report

Read more

Read more

One YouTube Video

One Use Case

Read more

Read more

One Research Paper

Read more

Read more

One Eye-Opener

One Jailbreaker

One Shift

One Meme

Read more

Read more