mindformers.dataset.CausalLanguageModelDataset¶
-
class
mindformers.dataset.CausalLanguageModelDataset(dataset_config: dict = None)[源代码]¶ Causal Language Model pretrain dataset.
实际案例
>>> from mindformers.tools.register import MindFormerConfig >>> from mindformers import MindFormerBook >>> from mindformers.dataset import CausalLanguageModelDataset >>> from mindformers.dataset import build_dataset, check_dataset_config >>> config_dict_list = MindFormerBook.get_trainer_support_task_list() >>> config_path = config_dict_list['text_generation']['gpt2'] >>> # Initialize a MindFormerConfig instance with a specific config file of yaml. >>> config = MindFormerConfig(config_path) >>> config.train_dataset.data_loader.dataset_dir = "The required task dataset path" Note: The detailed data setting could refer to https://gitee.com/mindspore/mindformers/blob/dev/docs/model_cards/gpt2.md >>> check_dataset_config(config) >>> # 1) use config dict to build dataset >>> dataset_from_config = build_dataset(config.train_dataset_task) >>> # 2) use class name to build dataset >>> dataset_from_name = build_dataset(class_name='CausalLanguageModelDataset', ... dataset_config=config.train_dataset_task.dataset_config) >>> # 3) use class to build dataset >>> dataset_from_class = CausalLanguageModelDataset(config.train_dataset_task.dataset_config)