Utility Functions

Note

These are internal functions used by UnBIAS to help run various parts of the pipeline. You will likely not need to use these.

`re_incomplete_sentence(text:str) -> str`

Removes incomplete sentences from a text. This may occur when Llama2 stops generating text due to a maximum token generation limit. For instance, a generated sentence may look like: The man was hungry. He went to, in which case it would remove He went to from the text.

Parameters: text: The text to clean.
Returns: The text with any incomplete sentences removed.

Example

import UnBIAS.unbias as unbias

cleaned_sentence = unbias.re_incomplete_sentence('The man was hungry. He went')
print(cleaned_sentence)

Output

The man was hungry.

`tokenize_for_prediction(text:str, tokenizer:PreTrainedTokenizer) -> list[int], list[int]`

Tokenizes the text and returns the corresponding input_ids and attention_mask.

Parameters: text: The text to tokenize.; tokenizer: A Hugging Face Tokenizer.
Returns: input_ids: List of integers representing tokenized text.; attention_mask: A list with the same length as input_ids, consisting of 1s and 0s, where 1 indicates a real token, and 0 indicates a padding token. This tells the model which tokens should be attended to.

Example

import UnBIAS.unbias as unbias
from transformers import AutoTokenizer

model = "newsmediabias/UnBIAS-LLama2-Debiaser-Chat-QLoRA"
tokenizer = AutoTokenizer.from_pretrained(model)
tokenized_text = unbias.tokenize_for_prediction('The man was hungry.', tokenizer)
print(tokenized_text)

Output

[ints representing token ids], [0s and 1s]

Utility Functions

re_incomplete_sentence(text:str) -> str

Parameters

Returns

Example

Output

tokenize_for_prediction(text:str, tokenizer:PreTrainedTokenizer) -> list[int], list[int]

Parameters

Returns

Example

Output

`re_incomplete_sentence(text:str) -> str`

`tokenize_for_prediction(text:str, tokenizer:PreTrainedTokenizer) -> list[int], list[int]`