Q12 of 21 · Testing AI systems
How do you structure a red-teaming exercise for an LLM-powered product?
Short answer
Short answer: Red-team with a defined harm scope, a structured attack taxonomy (prompt injection, jailbreaks, bias elicitation, data extraction, misuse), documentation of every finding, and a severity rating. Automate repeatable probes; use humans for creative adversarial exploration that automation misses.
Detail
Red-teaming for AI is adversarial evaluation — deliberately trying to make the system behave badly. It differs from functional testing in that the goal is to find failure modes, not verify expected behaviour.
Define scope and success criteria: what would constitute a harmful output for this product? A finance assistant hallucinating a stock price is different from a healthcare assistant fabricating a treatment protocol.
Attack taxonomy:
- Prompt injection (direct and indirect)
- Jailbreaks (role-play, hypothetical, token smuggling)
- Bias and stereotyping elicitation
- Data exfiltration (system prompt leakage, training data extraction)
- Misuse of intended functionality at scale
Automation + human combination: automated tools (Garak, promptfoo red-team mode) cover the known taxonomy at scale. Human red-teamers explore creatively — they find novel vectors automation misses.
Documentation: every successful attack filed as an issue with input, output, harm category, severity (P0–P3), and suggested mitigation. Mitigation verification closes the loop.
See Red-teaming and adversarial eval and Internal AI red team process.