Skip to content

Utility Functions

Note

These are internal functions used by UnBIAS to help run various parts of the pipeline. You will likely not need to use these.

re_incomplete_sentence(text:str) -> str

Removes incomplete sentences from a text. This may occur when Llama2 stops generating text due to a maximum token generation limit. For instance, a generated sentence may look like: The man was hungry. He went to, in which case it would remove He went to from the text.

Parameters

text: The text to clean.

Returns

The text with any incomplete sentences removed.

Example

import UnBIAS.unbias as unbias

cleaned_sentence = unbias.re_incomplete_sentence('The man was hungry. He went')
print(cleaned_sentence)

Output

The man was hungry.

tokenize_for_prediction(text:str, tokenizer:PreTrainedTokenizer) -> list[int], list[int]

Tokenizes the text and returns the corresponding input_ids and attention_mask.

Parameters

text: The text to tokenize.
tokenizer: A Hugging Face Tokenizer.

Returns

input_ids: List of integers representing tokenized text.
attention_mask: A list with the same length as input_ids, consisting of 1s and 0s, where 1 indicates a real token, and 0 indicates a padding token. This tells the model which tokens should be attended to.

Example

import UnBIAS.unbias as unbias
from transformers import AutoTokenizer

model = "newsmediabias/UnBIAS-LLama2-Debiaser-Chat-QLoRA"
tokenizer = AutoTokenizer.from_pretrained(model)
tokenized_text = unbias.tokenize_for_prediction('The man was hungry.', tokenizer)
print(tokenized_text)

Output

[ints representing token ids], [0s and 1s]