mindformers.models.glm.ChatGLMTokenizer¶

class mindformers.models.glm.ChatGLMTokenizer(vocab_file, do_lower_case=False, remove_space=False, bos_token='<sop>', eos_token='<eop>', end_token='</s>', mask_token='[MASK]', gmask_token='[gMASK]', pad_token='<pad>', unk_token='<unk>', num_image_tokens=0, **kwargs)[源代码]¶

Construct a ChatGLM tokenizer. Based on byte-level Byte-Pair-Encoding.

Args:

vocab_file(str): The vocabulary file path. do_lower_case(bool): Lower input text. Default False. remove_space(str): The merge file path. bos_token(str): The token that represents the begin-of-sentence. Default ‘<sop>’. eos_token(str): The token that represents the end-of-sentence. Default ‘<eop>’. end_token(str): The token that represents the end-of-sentence. Default ‘</s>’. mask_token(str): The token that represents the special mask. Default ‘[MASK]’, gmask_token(str): The token that represents the special mask. Default ‘[gMASK]’, pad_token(str): The token that represents the pad. Default “<pad>”. unk_token(str): The token that represents the unknown. Default ‘<unk>’. add_prefix_space(bool): whether to add a whitespace in the front of text. Default “False” **kwargs: Other kwargs that will be passed into the base class of the Tokenizer.

Examples:

>>> from mindformers import AutoTokenizer
>>> tokenize = AutoTokenizer.from_pretrained('glm_6b')
>>> tokenize("你好")
{'input_ids': [5, 74874, 130001, 130004], 'token_type_ids': [0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1]}
>>> from mindformers.models.glm.chatglm_6b_tokenizer import ChatGLMTokenizer
>>> tokenizer = ChatGLMTokenizer('ice_text.model')
>>> prompts_list = ["晚上睡不着应该怎么办"]
>>> token_id = tokenizer(prompts)
>>> input_ids = token_id['input_ids']
>>> print(input_ids)
[[74747, 83400, 64213, 66846, 130001, 130004]]
>>> response = tokenizer.decode(input_ids)
>>> print(response)
['晚上睡不着应该怎么办']

Outputs:

A dict contains the processed ids, attention_mask that specific by the member MODEL_INPUT_NAME of the subclass.

build_inputs_with_special_tokens(token_ids_0: List[int], token_ids_1: Optional[List[int]] = None)[源代码]¶

Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and adding special tokens. A BERT sequence has the following format:

single sequence: [CLS] X [SEP]
pair of sequences: [CLS] A [SEP] B [SEP]

Args:

token_ids_0 (List[int]):: List of IDs to which the special tokens will be added.
token_ids_1 (List[int], optional):: Optional second list of IDs for sequence pairs.

Returns:

List[int]: List of [input IDs](../glossary#input-ids) with the appropriate special tokens.

convert_tokens_to_ids(tokens: Union[str, List[str]]) → Union[int, List[int]][源代码]¶

Converts a token string (or a sequence of tokens) in a single integer id (or a sequence of ids), using the vocabulary.

Args:: tokens (str or List[str]): One or several token(s) to convert to token id(s).
Returns:: int or List[int]: The token id or list of token ids.

create_token_type_ids_from_sequences(token_ids_0: List[int], token_ids_1: Optional[List[int]] = None) → List[int][源代码]¶

Creates a mask from the two sequences passed to be used in a sequence-pair classification task. An ALBERT sequence pair mask has the following format:

` 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 | first sequence | second sequence | `

if token_ids_1 is None, only returns the first portion of the mask (0s).

Args:

token_ids_0 (List[int]):: List of ids.
token_ids_1 (List[int], optional):: Optional second list of IDs for sequence pairs.

Returns:

List[int]: List of [token type IDs](../glossary#token-type-ids) according to the given sequence(s).

property end_token_id: Optional[int]¶: Optional[int]: Id of the end of context token in the vocabulary. Returns None if the token has not been set.

get_vocab()[源代码]¶: Returns vocab as a dict

preprocess_text(inputs)[源代码]¶: Preprocess text.

save_vocabulary(save_directory, filename_prefix=None)[源代码]¶

Save the vocabulary and special tokens file to a directory.

Args:

save_directory (str):: The directory in which to save the vocabulary.
filename_prefix (str, optional):: An optional prefix to add to the named of the saved files.

Returns:

Tuple(str): Paths to the files saved.

tokenize(text, pair=None, add_special_tokens=True, **kwargs)[源代码]¶: Returns a tokenized string.

property vocab_size¶: Returns vocab size