FAIR Principles for Data Management
Ensuring data is Findable, Accessible, Interoperable, and Reusable is essential for effective data management, especially in the age of big data and AI. This repository shows our approach to adopting the FAIR principles, with an emphasis on specific data formats and methodologies for training classifiers.
Table of Contents
- Introduction
- Binary and Multi-label Classification
- CONLL BIO FORMAT
- Debiasing Training Data
- Active Learning and Labeling
Introduction
The FAIR principles guide our data management strategy, ensuring our datasets are:
-
Findable: Easily located and identified.
-
Accessible: Available for retrieval without unnecessary barriers.
-
Interoperable: Compatible with other data systems and platforms.
-
Reusable: Ready for future applications and re-analysis.
Binary and Multi-label Classification
To train the UnBIAS classifier, data should be labeled either as "biased" or "unbiased". Multi-label classifications are also supported.
Example:
CONLL BIO FORMAT
For training named entities, data should be in the CONLL BIO format.
"B-" means the beginning of an entity.
"I-" means inside or continuation of an entity.
"O" means outside of any entity.
Example:
Sentence: "He's surprisingly articulate for a young man from that neighborhood." BIO Format:
He's O
surprisingly O
articulate O
for B-BIAS
a I-BIAS
young I-BIAS
man I-BIAS
from I-BIAS
that I-BIAS
neighborhood I-BIAS
. O
O: Represents "Outside" and is used for words that are not part of any entity or specifically not part of the entity being tagged.
B-BIAS: Represents "Begin-BIAS". It indicates the beginning of a bias entity (or any other entity you're tracking). The "B-" prefix is used to mark the start of an entity.
I-BIAS: Represents "Inside-BIAS". It indicates a word that is inside or a continuation of a bias entity that started with a "B-BIAS" tag.
Debiasing Training Data
For the debiaser in UnBIAS, the dataset should contain both the original "biased_text" and its "debiased_version".
Example:
Biased Text: "Men are better leaders."
Debiased Version: "Leadership ability isn't gender-specific."
Active Learning and Labeling
Active learning is a crucial component of our data preparation methodology for UnBIAS. Here's our approach:
- Iterative Labeling: Begin by labeling a small subset of the data.
- Model Training: Train a preliminary model using this subset.
- Uncertainty Sampling: The model pinpoints data points of uncertainty.
- Human Intervention: Experts label these uncertain points.
- Model Refinement: The model is retrained with the new labeled data.

Note:
The above instructions are for cases where you want to train UnBIAS models on your training data or specific use case.
The Unbias package allows you to easily process a batch of sentences to detect and rectify biases. Here's how you can do it:
Running the Unbias Package
from UnBIAS import run_pipeline_on_texts
# Define your test sentences
test_sentences = [
"Women are just too emotional to be leaders.",
"All young people are lazy.",
"Men are naturally better at sports."
]
# Use the function
results = run_pipeline_on_texts(test_sentences)
result_df.head()
result_df.to_csv('UnBIAS-results.csv')
Contributors are welcome to aid in the enhancement of our documentation and methodologies.