mindformers.models.glm.ChatGLMTokenizer

class mindformers.models.glm.ChatGLMTokenizer(vocab_file, do_lower_case=False, remove_space=False, bos_token='<sop>', eos_token='<eop>', end_token='</s>', mask_token='[MASK]', gmask_token='[gMASK]', padding_side='left', pad_token='<pad>', unk_token='<unk>', num_image_tokens=0, **kwargs)[源代码]

Construct a ChatGLM tokenizer. Based on byte-level Byte-Pair-Encoding.

参数
  • vocab_file (str) – The vocabulary file path.

  • do_lower_case (bool) – Lower input text. Default False.

  • remove_space (str) – The merge file path.

  • bos_token (str) – The token that represents the begin-of-sentence. Default ‘<sop>’.

  • eos_token (str) – The token that represents the end-of-sentence. Default ‘<eop>’.

  • end_token (str) – The token that represents the end-of-sentence. Default ‘</s>’.

  • mask_token (str) – The token that represents the special mask. Default ‘[MASK]’,

  • gmask_token (str) – The token that represents the special mask. Default ‘[gMASK]’,

  • pad_token (str) – The token that represents the pad. Default “<pad>”.

  • unk_token (str) – The token that represents the unknown. Default ‘<unk>’.

  • add_prefix_space (bool) – whether to add a whitespace in the front of text. Default “False”

  • **kwargs – Other kwargs that will be passed into the base class of the Tokenizer.

实际案例

>>> from mindformers import AutoTokenizer
>>> tokenize = AutoTokenizer.from_pretrained('glm_6b')
>>> tokenize("你好")
{'input_ids': [5, 74874, 130001, 130004], 'token_type_ids': [0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1]}
>>> from mindformers.models.glm.chatglm_6b_tokenizer import ChatGLMTokenizer
>>> tokenizer = ChatGLMTokenizer('ice_text.model')
>>> prompts_list = ["晚上睡不着应该怎么办"]
>>> token_id = tokenizer(prompts)
>>> input_ids = token_id['input_ids']
>>> print(input_ids)
[[74747, 83400, 64213, 66846, 130001, 130004]]
>>> response = tokenizer.decode(input_ids)
>>> print(response)
['晚上睡不着应该怎么办']
Outputs:

A dict contains the processed ids, attention_mask that specific by the member MODEL_INPUT_NAME of the subclass.

build_inputs_with_special_tokens(token_ids_0: List[int], token_ids_1: Optional[List[int]] = None)[源代码]

Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and adding special tokens. A BERT sequence has the following format:

  • single sequence: [CLS] X [SEP]

  • pair of sequences: [CLS] A [SEP] B [SEP]

参数
  • token_ids_0 (List[int]) – List of IDs to which the special tokens will be added.

  • token_ids_1 (List[int], optional) – Optional second list of IDs for sequence pairs.

返回

List of [input IDs](../glossary#input-ids) with the appropriate special tokens.

返回类型

List[int]

convert_ids_to_tokens(ids: Union[int, List[int]], skip_special_tokens: bool = False)[源代码]

Converts a single index or a sequence of indices in a token or a sequence of tokens, using the vocabulary and added tokens.

参数
  • ids (int or List[int]) – The token id (or token ids) to convert to tokens.

  • skip_special_tokens (bool, optional, defaults to False) – Whether or not to remove special tokens in the decoding.

返回

The decoded token(s).

返回类型

str or List[str]

convert_tokens_to_ids(tokens: Union[str, List[str]]) → Union[int, List[int]][源代码]

Converts a token string (or a sequence of tokens) in a single integer id (or a sequence of ids), using the vocabulary.

参数

tokens (str or List[str]) – One or several token(s) to convert to token id(s).

返回

The token id or list of token ids.

返回类型

int or List[int]

property end_token_id

Id of the end of context token in the vocabulary. Returns None if the token has not been set.

Type

Optional[int]

get_vocab()[源代码]

Returns vocab as a dict

preprocess_text(inputs)[源代码]

Preprocess text.

save_vocabulary(save_directory, filename_prefix=None)[源代码]

Save the vocabulary and special tokens file to a directory.

参数
  • save_directory (str) – The directory in which to save the vocabulary.

  • filename_prefix (str, optional) – An optional prefix to add to the named of the saved files.

返回

Paths to the files saved.

返回类型

Tuple(str)

tokenize(text)[源代码]

Returns a tokenized string.

property vocab_size

Returns vocab size