mindformers.models.glm.ChatGLMTokenizer¶
-
class
mindformers.models.glm.ChatGLMTokenizer(vocab_file, do_lower_case=False, remove_space=False, bos_token='<sop>', eos_token='<eop>', end_token='</s>', mask_token='[MASK]', gmask_token='[gMASK]', padding_side='left', pad_token='<pad>', unk_token='<unk>', num_image_tokens=0, **kwargs)[源代码]¶ Construct a ChatGLM tokenizer. Based on byte-level Byte-Pair-Encoding.
- 参数
vocab_file (str) – The vocabulary file path.
do_lower_case (bool) – Lower input text. Default False.
remove_space (str) – The merge file path.
bos_token (str) – The token that represents the begin-of-sentence. Default ‘<sop>’.
eos_token (str) – The token that represents the end-of-sentence. Default ‘<eop>’.
end_token (str) – The token that represents the end-of-sentence. Default ‘</s>’.
mask_token (str) – The token that represents the special mask. Default ‘[MASK]’,
gmask_token (str) – The token that represents the special mask. Default ‘[gMASK]’,
pad_token (str) – The token that represents the pad. Default “<pad>”.
unk_token (str) – The token that represents the unknown. Default ‘<unk>’.
add_prefix_space (bool) – whether to add a whitespace in the front of text. Default “False”
**kwargs – Other kwargs that will be passed into the base class of the Tokenizer.
实际案例
>>> from mindformers import AutoTokenizer >>> tokenize = AutoTokenizer.from_pretrained('glm_6b') >>> tokenize("你好") {'input_ids': [5, 74874, 130001, 130004], 'token_type_ids': [0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1]} >>> from mindformers.models.glm.chatglm_6b_tokenizer import ChatGLMTokenizer >>> tokenizer = ChatGLMTokenizer('ice_text.model') >>> prompts_list = ["晚上睡不着应该怎么办"] >>> token_id = tokenizer(prompts) >>> input_ids = token_id['input_ids'] >>> print(input_ids) [[74747, 83400, 64213, 66846, 130001, 130004]] >>> response = tokenizer.decode(input_ids) >>> print(response) ['晚上睡不着应该怎么办']
- Outputs:
A dict contains the processed ids, attention_mask that specific by the member MODEL_INPUT_NAME of the subclass.
-
build_inputs_with_special_tokens(token_ids_0: List[int], token_ids_1: Optional[List[int]] = None)[源代码]¶ Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and adding special tokens. A BERT sequence has the following format:
single sequence: [CLS] X [SEP]
pair of sequences: [CLS] A [SEP] B [SEP]
- 参数
token_ids_0 (List[int]) – List of IDs to which the special tokens will be added.
token_ids_1 (List[int], optional) – Optional second list of IDs for sequence pairs.
- 返回
List of [input IDs](../glossary#input-ids) with the appropriate special tokens.
- 返回类型
List[int]
-
convert_ids_to_tokens(ids: Union[int, List[int]], skip_special_tokens: bool = False)[源代码]¶ Converts a single index or a sequence of indices in a token or a sequence of tokens, using the vocabulary and added tokens.
- 参数
ids (int or List[int]) – The token id (or token ids) to convert to tokens.
skip_special_tokens (bool, optional, defaults to False) – Whether or not to remove special tokens in the decoding.
- 返回
The decoded token(s).
- 返回类型
str or List[str]
-
convert_tokens_to_ids(tokens: Union[str, List[str]]) → Union[int, List[int]][源代码]¶ Converts a token string (or a sequence of tokens) in a single integer id (or a sequence of ids), using the vocabulary.
- 参数
tokens (str or List[str]) – One or several token(s) to convert to token id(s).
- 返回
The token id or list of token ids.
- 返回类型
int or List[int]
-
property
end_token_id¶ Id of the end of context token in the vocabulary. Returns None if the token has not been set.
- Type
Optional[int]
-
save_vocabulary(save_directory, filename_prefix=None)[源代码]¶ Save the vocabulary and special tokens file to a directory.
- 参数
save_directory (str) – The directory in which to save the vocabulary.
filename_prefix (str, optional) – An optional prefix to add to the named of the saved files.
- 返回
Paths to the files saved.
- 返回类型
Tuple(str)
-
property
vocab_size¶ Returns vocab size