Why Diverse Data Matters in ML

When data collection doesn’t represent the real world, ML systems fail. Facial recognition technology has been criticised for performing poorly on people with darker skin tones because the training datasets had far fewer images of diverse individuals. The consequences of bad data quality and consistency, can be anything from annoying inefficiencies - to dangerous misjudgments.

(Screenshot from Avant's Data Visualisation Interface" - Even basic infrastructure data can become complex very quickly.

Key Points:

Machine learning (ML) systems often fail when trained on datasets lacking diversity, as they cannot generalise to real-world scenarios.
Post-collection interventions, like tweaking the data or retraining the model, are time-consuming and rarely comprehensive.
A better approach involves tracking and managing the entire data collection and training process to ensure datasets reflect real-world variability.
A three-step process for improving data diversity includes:
- Pre-Collection Planning: Defining expected data distributions and ensuring they match real-world needs.
- Collection Monitoring: Actively encouraging diverse sampling during data collection.
- Data Familiarity Analysis: Using techniques like density estimation to identify and address unfamiliar data points.

Models learn patterns from data, but if that data is limited or one-sided, the models’ ability to recognise variations—even subtle ones—is severely hampered.

Imagine teaching a kid to recognise different types of fruit. If you only show them red apples and green bananas, they’ll struggle to identify a yellow apple or a ripe banana. This same challenge arises in machine learning (ML).

To fix these issues, developers often try to clean up the data or retrain the model after the fact. While this can help, it’s like patching a leaky roof rather than replacing it. These fixes are time-consuming, costly, and rarely address all the gaps. A smarter, more proactive approach is needed—one that prioritises data diversity from the start and tracks it throughout the process.

The Solution: Building Diversity Into the Process

Instead of reacting to data gaps, we need a proactive approach that ensures diversity at every stage: from collection to training. The good news is there’s a clear roadmap for getting this done right.

Step 1: Pre-Collection Planning

Before collecting data, it’s essential to plan. Think of this as mapping out your garden before planting anything. In this stage, you ask questions like:

What kinds of data do we need?
How do these data types vary in the real world?
Are there groups or scenarios that could easily be overlooked?

For example, if you’re building an app to detect potholes in roads, you’d consider all kinds of roads (urban, rural, highways) and weather conditions (sunny, rainy, snowy). The goal is to reflexively prompt and document expected data distributions so nothing gets missed.

Step 2: Collection Monitoring

Once data collection begins, it’s vital to monitor what you’re capturing. Without oversight, you might end up with 90% of your data coming from urban roads and none from rural highways. Collection monitoring is like having a gardener check that all plants are getting enough water.

This step involves systematic tracking to ensure diversity. Tools can help flag imbalances, such as over-representation of certain groups or under-representation of others. Encouraging diverse sampling during this phase helps you build a dataset that reflects real-world variability.

Step 3: Data Familiarity Analysis

Even with careful planning and monitoring, gaps can remain. That’s where data familiarity comes in. This technique involves assessing your dataset to identify samples that the model finds “unfamiliar.” Think of it as the model’s way of saying, “I’ve never seen this before, and I don’t know what to do with it.”

One way to do this is through density estimation. When you calculate how often certain patterns or data points appear, you can highlight anomalies or rare samples. These can then be reviewed and addressed to improve the model’s robustness.

To finish off, all decent machine learning systems will need data that reflects the variety of situations they will encounter in the real world.

Follow a structured approach that starts with planning what types of data are needed and checking that all important groups or conditions are included. As data is collected, actively monitor its diversity to prevent over-representation of some data groups and under-representation of others.

Why Diverse Data Matters in Machine Learning

Recent Posts

Comments

Be the First to Receive Latest News and Tech Updates.