Anthropic’s Bold Approach to Combating Racist AI: Polite Persistence

The problem of alignment is important when setting AI for decisions in finance and health. Can biases be reduced if they’re built into a model from training data? Anthropic suggests asking it nicely not to discriminate. In a self-published paper, Anthropic researched preventing discrimination in AI language models like Claude 2.0. Changing race, age, or gender affects model decisions, with being Black resulting in the strongest bias. Rephrasing the question or asking the model to “think out loud” didn’t affect bias. Using “interventions” to tell the model not to be biased worked well, reducing bias near zero in many cases. The paper discusses whether this technique can be systematically added to prompts and at a higher model level. The conclusions warn that models like Claude are unsuitable for important decisions. Governments and society should influence the use of models for high-stakes decisions. It remains important to anticipate and mitigate such risks as early as possible.