Global Big Data Conference

Industry News Details

4 Sources of Machine Learning Bias & How to Mitigate the Impact on AI Systems Posted on : Aug 20 - 2018

Artificial intelligence (AI) isn’t perfect. It exists as a combination of algorithms and data; bias can occur in both of these elements.

When we produce AI training data, we know to look for biases that can influence machine learning (ML). In our experience, there are four distinct types of bias that data scientists and AI developers should avoid vigilantly.

Algorithm Bias

Bias in this context has nothing to do with data. It’s actually a mathematical property of the algorithm that is acting on the data. Managing this kind of bias and its counterpart, variance, is a core data science skill.

Algorithms with high bias tend to be rigid. As a result they can miss underlying complexities in the data they consume. However, they are also more resistant to noise in the data, which can distract algorithms with lower bias.

By contrast, algorithms with high variance can accommodate more data complexity, but they’re also more sensitive to noise and less likely to process with confidence data that is outside the training data set.

Data scientists are trained in techniques that produce an optimal balance between algorithmic bias and variance. It’s a balance that has to be revisited over and over, as models encounter more data and are found to predict with more or less confidence.

Sample Bias

Sample bias occurs when the data used to train the algorithm does not accurately represent the problem space the model will operate in.

Algorithms with high variance can accommodate more data complexity, but they’re also more sensitive to noise and less likely to process with confidence data that is outside the training data set.

For example, if an autonomous vehicle is expected to operate in the daytime and at night, but is trained only on daytime data, its training data has sample bias. The model driving the vehicle is highly unlikely to learn how to operate at night with incomplete and unrepresentative training data.

Data scientists use a variety of techniques to:

Select samples from populations and validate their representativeness
Identify population characteristics that need to be captured in samples
Analyze a sample’s fit with the population

Prejudicial Bias

Prejudicial bias tends to dominate the headlines around AI failures, because it often touches on cultural and political issues. It occurs when training data content is influenced by stereotypes or prejudice within the population. Data scientists and organizations need to make sure the algorithm doesn’t learn and manifest outputs that echo stereotypes or prejudice. View More

Get the