Utility Functions
Note
These are internal functions used by UnBIAS to help run various parts of the pipeline. You will likely not need to use these.
re_incomplete_sentence(text:str) -> str
-
Removes incomplete sentences from a text. This may occur when Llama2 stops generating text due to a maximum token generation limit. For instance, a generated sentence may look like:
The man was hungry. He went to, in which case it would removeHe went tofrom the text.Parameters
- text: The text to clean.
Returns
- The text with any incomplete sentences removed.
Example
import UnBIAS.unbias as unbias cleaned_sentence = unbias.re_incomplete_sentence('The man was hungry. He went') print(cleaned_sentence)Output
The man was hungry.
tokenize_for_prediction(text:str, tokenizer:PreTrainedTokenizer) -> list[int], list[int]
-
Tokenizes the text and returns the corresponding
input_idsandattention_mask.Parameters
- text: The text to tokenize.
- tokenizer: A Hugging Face Tokenizer.
Returns
- input_ids: List of integers representing tokenized text.
- attention_mask: A list with the same length as input_ids, consisting of 1s and 0s, where 1 indicates a real token, and 0 indicates a padding token. This tells the model which tokens should be attended to.
Example
import UnBIAS.unbias as unbias from transformers import AutoTokenizer model = "newsmediabias/UnBIAS-LLama2-Debiaser-Chat-QLoRA" tokenizer = AutoTokenizer.from_pretrained(model) tokenized_text = unbias.tokenize_for_prediction('The man was hungry.', tokenizer) print(tokenized_text)Output
[ints representing token ids], [0s and 1s]