mindformers.models.bert.BertTokenizer¶
- class mindformers.models.bert.BertTokenizer(vocab_file, do_lower_case=True, do_basic_tokenize=True, never_split=None, unk_token='[UNK]', sep_token='[SEP]', pad_token='[PAD]', cls_token='[CLS]', mask_token='[MASK]', tokenize_chinese_chars=True, strip_accents=None, is_tokenize_char=False, **kwargs)[源代码]¶
Construct a BERT tokenizer. Based on WordPiece.
This tokenizer inherits from [PreTrainedTokenizer] which contains most of the main methods. Users should refer to this superclass for more information regarding those methods.
- Args:
- vocab_file (str):
File containing the vocabulary.
- do_lower_case (bool, optional, defaults to True):
Whether or not to lowercase the input when tokenizing.
- do_basic_tokenize (bool, optional, defaults to True):
Whether or not to do basic tokenization before WordPiece.
- never_split (Iterable, optional):
Collection of tokens which will never be split during tokenization. Only has an effect when do_basic_tokenize=True
- unk_token (str, optional, defaults to “[UNK]”):
The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this token instead.
- sep_token (str, optional, defaults to “[SEP]”):
The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for sequence classification or for a text and a question for question answering. It is also used as the last token of a sequence built with special tokens.
- pad_token (str, optional, defaults to “[PAD]”):
The token used for padding, for example when batching sequences of different lengths.
- cls_token (str, optional, defaults to “[CLS]”):
The classifier token which is used when doing sequence classification (classification of the whole sequence instead of per-token classification). It is the first token of the sequence when built with special tokens.
- mask_token (str, optional, defaults to “[MASK]”):
The token used for masking values. This is the token used when training this model with masked language modeling. This is the token which the model will try to predict.
- tokenize_chinese_chars (bool, optional, defaults to True):
Whether or not to tokenize Chinese characters.
- is_tokenize_char (bool, optional, defaults to False):
Whether or not to tokenize characters.
This should likely be deactivated for Japanese (see this [issue](https://github.com/huggingface/transformers/issues/328)).
- strip_accents (bool, optional):
Whether or not to strip all accents. If this option is not specified, then it will be determined by the value for lowercase (as in the original BERT).
- build_inputs_with_special_tokens(token_ids_0: List[int], token_ids_1: Optional[List[int]] = None) List[int][源代码]¶
Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and adding special tokens. A BERT sequence has the following format:
single sequence: [CLS] X [SEP]
pair of sequences: [CLS] A [SEP] B [SEP]
- Args:
- token_ids_0 (List[int]):
List of IDs to which the special tokens will be added.
- token_ids_1 (List[int], optional):
Optional second list of IDs for sequence pairs.
- Returns:
List[int]: List of [input IDs](../glossary#input-ids) with the appropriate special tokens.
- create_token_type_ids_from_sequences(token_ids_0: List[int], token_ids_1: Optional[List[int]] = None) List[int][源代码]¶
Create a mask from the two sequences passed to be used in a sequence-pair classification task. A BERT sequence pair mask has the following format:
` 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 | first sequence | second sequence | `If token_ids_1 is None, this method only returns the first portion of the mask (0s).
- Args:
- token_ids_0 (List[int]):
List of IDs.
- token_ids_1 (List[int], optional):
Optional second list of IDs for sequence pairs.
- Returns:
List[int]: List of [token type IDs](../glossary#token-type-ids) according to the given sequence(s).
- get_special_tokens_mask(token_ids_0: List[int], token_ids_1: Optional[List[int]] = None, already_has_special_tokens: bool = False) List[int][源代码]¶
Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding special tokens using the tokenizer prepare_for_model method.
- Args:
- token_ids_0 (List[int]):
List of IDs.
- token_ids_1 (List[int], optional):
Optional second list of IDs for sequence pairs.
- already_has_special_tokens (bool, optional, defaults to False):
Whether or not the token list is already formatted with special tokens for the model.
- Returns:
List[int]: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.