Social Media

Light
Dark

Anthropic’s latest tactic to stop racist AI: Asking it ‘really really really really’ nicely

The challenge of achieving alignment becomes particularly crucial when deploying AI models for decision-making in realms such as finance and health. Addressing biases ingrained during a model’s training, especially those linked to protected categories, poses a significant concern. Anthropic proposes an unconventional approach: politely urging the model not to exhibit discrimination, lest legal consequences follow.

In an internally published paper, researchers, spearheaded by Alex Tamkin at Anthropic, explored methods to prevent their language model, Claude 2.0, from displaying bias in scenarios like job and loan applications. Investigating how alterations in attributes like race, age, and gender impact the model’s decisions, they discovered considerable discrimination, with being Black showing the most pronounced effect, followed by Native American and nonbinary individuals.

Various approaches to rephrasing queries or encouraging the model to vocalize its thought process proved ineffective. However, what proved successful were “interventions” – additional prompts appended to instruct the model not to exhibit bias. For instance, expressing that due to a technical glitch, protected characteristics are included but should be disregarded in decision-making. Astonishingly, these interventions drastically reduced discrimination in many test cases.

The team found that combining interventions, such as a repeated emphasis on the importance of not discriminating, further enhanced effectiveness. Despite the seemingly superficial nature of these interventions, the researchers achieved near-zero discrimination in several scenarios.

The question arises: can these interventions be systematically incorporated into relevant prompts or integrated into models at a foundational level? Is this approach scalable or even adoptable as a fundamental guiding principle? While the researchers emphasize that models like Claude are unsuitable for critical decisions, they acknowledge the need for a broader societal and governmental influence on the responsible use of such models, rather than leaving decisions solely to individual firms. They underscore the importance of anticipating and mitigating potential risks early, emphasizing the societal role in shaping the use of language models for high-stakes decisions.

Leave a Reply

Your email address will not be published. Required fields are marked *