Back

 Industry News Details

 
How bias creeps into the AI designed to detect toxicity Posted on : Dec 09 - 2021

In 2017, Google’s Counter Abuse Technology team and Jigsaw, the organization working under Google parent company Alphabet to tackle cyberbullying and disinformation, released an AI-powered API for content moderation called Perspective. Perspective’s goal is to “identify toxic comments that can undermine a civil exchange of ideas,” offering a score from zero to 100 on how similar new comments are to others previously identified as toxic, defined as how likely a comment is to make someone leave a conversation.

Jigsaw claims its AI can immediately generate an assessment of a phrase’s toxicity more accurately than any keyword blacklist and faster than any human moderator. But studies show that technologies similar to Jigsaw’s still struggle to overcome major challenges, including biases against specific subsets of users.

For example, a team at Penn State recently found that posts on social media about people with disabilities could be flagged as more negative or toxic by commonly used, public sentiment, and toxicity detection models. After training several of these models to complete an open benchmark from Jigsaw, the team observed that the models learned to associate “negative-sentiment” words like “drugs,” “homelessness,” “addiction,” and “gun violence” with disability — and the words “blind,” “autistic,” “deaf,” and “mentally handicapped” with a negative sentiment.

“The biggest issue is that they are public models that are easily used to classify texts based on sentiment,” Pranav Narayanan Venkit and Shomir Wilson, the coauthors of the paper, told VentureBeat via email. Narayanan Venkit is a Ph.D. student in informatics at Penn State and Wilson is an assistant professor in Penn State’s College of Information Sciences. “The results are important as they show how machine learning solutions are not perfect and how we need to be more responsible for the technology we create. Such outright discrimination is both wrong and detrimental to the community as it does not represent such communities or languages accurately.”

Emergent biases

Studies show that language models amplify the biases in data on which they were trained. For instance, Codex, a code-generating model developed by OpenAI, can be prompted to write “terrorist” when fed the word “Islam.” Another large language model from Cohere tends to associate men and women with stereotypically “male” and “female” occupations, like “male scientist” and “female housekeeper.”

That’s because language models are essentially a probability distribution over words or sequences of words. In practice, a model gives the probability of a word sequence being “valid” — i.e., resembling how people write. Some language models are trained on hundreds of gigabytes of text from occasionally toxic websites and so learn to correlate certain genders, races, ethnicities, and religions with “negative” words, because the negative words are overrepresented in the texts.

The model powering Perspective was built to classify rather than generate text. But it learns the same associations — and therefore biases — as generative models. In a study published by researchers at the University of Oxford, the Alan Turing Institute, Utrecht University, and the University of Sheffield, an older version of Perspective struggled to recognize hate speech that used “reclaimed” slurs like “queer” and spelling variations like missing characters. View more