mindformers.models.glm.ChatGLMTokenizer¶
- class mindformers.models.glm.ChatGLMTokenizer(vocab_file, do_lower_case=False, remove_space=False, bos_token='<sop>', eos_token='<eop>', end_token='</s>', mask_token='[MASK]', gmask_token='[gMASK]', pad_token='<pad>', unk_token='<unk>', num_image_tokens=0, **kwargs)[源代码]¶
Construct a ChatGLM tokenizer. Based on byte-level Byte-Pair-Encoding.
- Args:
vocab_file(str): The vocabulary file path. do_lower_case(bool): Lower input text. Default False. remove_space(str): The merge file path. bos_token(str): The token that represents the begin-of-sentence. Default ‘<sop>’. eos_token(str): The token that represents the end-of-sentence. Default ‘<eop>’. end_token(str): The token that represents the end-of-sentence. Default ‘</s>’. mask_token(str): The token that represents the special mask. Default ‘[MASK]’, gmask_token(str): The token that represents the special mask. Default ‘[gMASK]’, pad_token(str): The token that represents the pad. Default “<pad>”. unk_token(str): The token that represents the unknown. Default ‘<unk>’. add_prefix_space(bool): whether to add a whitespace in the front of text. Default “False” **kwargs: Other kwargs that will be passed into the base class of the Tokenizer.
- Examples:
>>> from mindformers import AutoTokenizer >>> tokenize = AutoTokenizer.from_pretrained('glm_6b') >>> tokenize("你好") {'input_ids': [5, 74874, 130001, 130004], 'token_type_ids': [0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1]} >>> from mindformers.models.glm.chatglm_6b_tokenizer import ChatGLMTokenizer >>> tokenizer = ChatGLMTokenizer('ice_text.model') >>> prompts_list = ["晚上睡不着应该怎么办"] >>> token_id = tokenizer(prompts) >>> input_ids = token_id['input_ids'] >>> print(input_ids) [[74747, 83400, 64213, 66846, 130001, 130004]] >>> response = tokenizer.decode(input_ids) >>> print(response) ['晚上睡不着应该怎么办']
- Outputs:
A dict contains the processed ids, attention_mask that specific by the member MODEL_INPUT_NAME of the subclass.
- build_inputs_with_special_tokens(token_ids_0: List[int], token_ids_1: Optional[List[int]] = None)[源代码]¶
Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and adding special tokens. A BERT sequence has the following format:
single sequence: [CLS] X [SEP]
pair of sequences: [CLS] A [SEP] B [SEP]
- Args:
- token_ids_0 (List[int]):
List of IDs to which the special tokens will be added.
- token_ids_1 (List[int], optional):
Optional second list of IDs for sequence pairs.
- Returns:
List[int]: List of [input IDs](../glossary#input-ids) with the appropriate special tokens.
- convert_tokens_to_ids(tokens: Union[str, List[str]]) Union[int, List[int]][源代码]¶
Converts a token string (or a sequence of tokens) in a single integer id (or a sequence of ids), using the vocabulary.
- Args:
tokens (str or List[str]): One or several token(s) to convert to token id(s).
- Returns:
int or List[int]: The token id or list of token ids.
- create_token_type_ids_from_sequences(token_ids_0: List[int], token_ids_1: Optional[List[int]] = None) List[int][源代码]¶
Creates a mask from the two sequences passed to be used in a sequence-pair classification task. An ALBERT sequence pair mask has the following format:
` 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 | first sequence | second sequence | `if token_ids_1 is None, only returns the first portion of the mask (0s).
- Args:
- token_ids_0 (List[int]):
List of ids.
- token_ids_1 (List[int], optional):
Optional second list of IDs for sequence pairs.
- Returns:
List[int]: List of [token type IDs](../glossary#token-type-ids) according to the given sequence(s).
- property end_token_id: Optional[int]¶
Optional[int]: Id of the end of context token in the vocabulary. Returns None if the token has not been set.
- save_vocabulary(save_directory, filename_prefix=None)[源代码]¶
Save the vocabulary and special tokens file to a directory.
- Args:
- save_directory (str):
The directory in which to save the vocabulary.
- filename_prefix (str, optional):
An optional prefix to add to the named of the saved files.
- Returns:
Tuple(str): Paths to the files saved.
- property vocab_size¶
Returns vocab size