Safety, red-team, robustness papers

TruthfulQA, BBQ, Constitutional AI, GCG, sycophancy, HarmBench, sandbagging, alignment faking.

Eleven papers covering safety, robustness, and red-team. TruthfulQA, BBQ, and Constitutional AI set the agenda: truthfulness, bias, and an alignment recipe that scales without 100K human labels. GCG, Sclar et al. on prompt sensitivity, HarmBench, and the sycophancy paper showed how easily things break when you push. The 2024 to 2025 results, reward hacking, sandbagging, and alignment faking, are the ones to read before claiming a model is safe.