美文网首页
[fairseq] tutorial

[fairseq] tutorial

作者: VanJordan | 来源:发表于2019-04-23 13:28 被阅读0次

    首先学习一下官方的LSTM的教程

    • 建立一个encoderdecoder模型,分别要继承FairseqEncoderFairseqDecoder
    • encoder除过要实现最基本的initforward函数还要实现一个reorder_encoder_out函数,这个函数是用来将一个batch内部的元素进行重排序的。
    def reorder_encoder_out(self, encoder_out, new_order):
        """
        Reorder encoder output according to `new_order`.
    
        Args:
            encoder_out: output from the ``forward()`` method
            new_order (LongTensor): desired order
    
        Returns:
            `encoder_out` rearranged according to `new_order`
        """
        final_hidden = encoder_out['final_hidden']
        return {
            'final_hidden': final_hidden.index_select(0, new_order),
        }
    
    • 注册模型,使用register_model
    from fairseq.models import FairseqModel, register_model
    
    # Note: the register_model "decorator" should immediately precede the
    # definition of the Model class.
    
    @register_model('simple_lstm')
    class SimpleLSTMModel(FairseqModel):
    
    • 所有注册的模型必须编写BaseFairseqModel的接口,对于序列到序列模型,我们需要编写FairseqModel里面的接口。
    • 实例化add_args函数,build_model函数
    from fairseq.models import FairseqModel, register_model
    
    # Note: the register_model "decorator" should immediately precede the
    # definition of the Model class.
    
    @register_model('simple_lstm')
    class SimpleLSTMModel(FairseqModel):
    
        @staticmethod
        def add_args(parser):
            # Models can override this method to add new command-line arguments.
            # Here we'll add some new command-line arguments to configure dropout
            # and the dimensionality of the embeddings and hidden states.
            parser.add_argument(
                '--encoder-embed-dim', type=int, metavar='N',
                help='dimensionality of the encoder embeddings',
            )
            parser.add_argument(
                '--encoder-hidden-dim', type=int, metavar='N',
                help='dimensionality of the encoder hidden state',
            )
            parser.add_argument(
                '--encoder-dropout', type=float, default=0.1,
                help='encoder dropout probability',
            )
            parser.add_argument(
                '--decoder-embed-dim', type=int, metavar='N',
                help='dimensionality of the decoder embeddings',
            )
            parser.add_argument(
                '--decoder-hidden-dim', type=int, metavar='N',
                help='dimensionality of the decoder hidden state',
            )
            parser.add_argument(
                '--decoder-dropout', type=float, default=0.1,
                help='decoder dropout probability',
            )
    
        @classmethod
        def build_model(cls, args, task):
            # Fairseq initializes models by calling the ``build_model()``
            # function. This provides more flexibility, since the returned model
            # instance can be of a different type than the one that was called.
            # In this case we'll just return a SimpleLSTMModel instance.
    
            # Initialize our Encoder and Decoder.
            encoder = SimpleLSTMEncoder(
                args=args,
                dictionary=task.source_dictionary,
                embed_dim=args.encoder_embed_dim,
                hidden_dim=args.encoder_hidden_dim,
                dropout=args.encoder_dropout,
            )
            decoder = SimpleLSTMDecoder(
                dictionary=task.target_dictionary,
                encoder_hidden_dim=args.encoder_hidden_dim,
                embed_dim=args.decoder_embed_dim,
                hidden_dim=args.decoder_hidden_dim,
                dropout=args.decoder_dropout,
            )
            model = SimpleLSTMModel(encoder, decoder)
    
            # Print the model architecture.
            print(model)
    
            return model
    
        # We could override the ``forward()`` if we wanted more control over how
        # the encoder and decoder interact, but it's not necessary for this
        # tutorial since we can inherit the default implementation provided by
        # the FairseqModel base class, which looks like:
        #
        # def forward(self, src_tokens, src_lengths, prev_output_tokens):
        #     encoder_out = self.encoder(src_tokens, src_lengths)
        #     decoder_out = self.decoder(prev_output_tokens, encoder_out)
        #     return decoder_out
    
    • 最后使用register_model_architecture()注册这个模型结构,注册函数的第一个参数是刚才注册的模型的名称,第二个参数是次结构的名称,那么以后可以直接在命令行使用-a tutorial_simple_lstm来使用这个结构和参数了。
    • getattr的作用是如果encoder_embed_dim定义了默认值那么就使用默认值作为模型参数,否则使用256
    • 写在这里的参数一般是不需要改的,如果需要调节才能达到一个比较好的结果那么参数应该写在model哪里这里不用getattr.
    • 使用setup_task函数加载数据,然后返回一个task到的实例
    from fairseq.models import register_model_architecture
    
    # The first argument to ``register_model_architecture()`` should be the name
    # of the model we registered above (i.e., 'simple_lstm'). The function we
    # register here should take a single argument *args* and modify it in-place
    # to match the desired architecture.
    
    @register_model_architecture('simple_lstm', 'tutorial_simple_lstm')
    def tutorial_simple_lstm(args):
        # We use ``getattr()`` to prioritize arguments that are explicitly given
        # on the command-line, so that the defaults defined below are only used
        # when no other value has been specified.
        args.encoder_embed_dim = getattr(args, 'encoder_embed_dim', 256)
        args.encoder_hidden_dim = getattr(args, 'encoder_hidden_dim', 256)
        args.decoder_embed_dim = getattr(args, 'decoder_embed_dim', 256)
        args.decoder_hidden_dim = getattr(args, 'decoder_hidden_dim', 256)
    
    • 在命令行进行训练
    fairseq-train data-bin/iwslt14.tokenized.de-en \
      --arch tutorial_simple_lstm \
      --encoder-dropout 0.2 --decoder-dropout 0.2 \
      --optimizer adam --lr 0.005 --lr-shrink 0.5 \
      --max-tokens 12000
    
    • 一个比较有意思的是register_buffer函数创建变量,这样如果使用的是GPU那么自动会.cuda
    class FairseqRNNClassifier(BaseFairseqModel):
      def __init__(self, rnn, input_vocab):
            # The RNN module in the tutorial expects one-hot inputs, so we can
            # precompute the identity matrix to help convert from indices to
            # one-hot vectors. We register it as a buffer so that it is moved to
            # the GPU when ``cuda()`` is called.
            self.register_buffer('one_hot_inputs', torch.eye(len(input_vocab)))
    
    

    task

    • t2t一样最重要的还是task,因为在task上加在数据,然后把数据传递给模型然后拿到返回结果。
    • load_dataset的时候split就是那个阶段是trainvalidtest中的哪一个。
    • 注意
    import os
    import torch
    
    from fairseq.data import Dictionary, LanguagePairDataset
    from fairseq.tasks import FairseqTask, register_task
    
    
    @register_task('simple_classification')
    class SimpleClassificationTask(FairseqTask):
    
        @staticmethod
        def add_args(parser):
            # Add some command-line arguments for specifying where the data is
            # located and the maximum supported input length.
            parser.add_argument('data', metavar='FILE',
                                help='file prefix for data')
            parser.add_argument('--max-positions', default=1024, type=int,
                                help='max input length')
    
        @classmethod
        def setup_task(cls, args, **kwargs):
            # Here we can perform any setup required for the task. This may include
            # loading Dictionaries, initializing shared Embedding layers, etc.
            # In this case we'll just load the Dictionaries.
            input_vocab = Dictionary.load(os.path.join(args.data, 'dict.input.txt'))
            label_vocab = Dictionary.load(os.path.join(args.data, 'dict.label.txt'))
            print('| [input] dictionary: {} types'.format(len(input_vocab)))
            print('| [label] dictionary: {} types'.format(len(label_vocab)))
    
            return SimpleClassificationTask(args, input_vocab, label_vocab)
    
        def __init__(self, args, input_vocab, label_vocab):
            super().__init__(args)
            self.input_vocab = input_vocab
            self.label_vocab = label_vocab
    
        def load_dataset(self, split, **kwargs):
            """Load a given dataset split (e.g., train, valid, test)."""
    
            prefix = os.path.join(self.args.data, '{}.input-label'.format(split))
    
            # Read input sentences.
            sentences, lengths = [], []
            with open(prefix + '.input', encoding='utf-8') as file:
                for line in file:
                    sentence = line.strip()
    
                    # Tokenize the sentence, splitting on spaces
                    tokens = self.input_vocab.encode_line(
                        sentence, add_if_not_exist=False,
                    )
    
                    sentences.append(tokens)
                    lengths.append(tokens.numel())
    
            # Read labels.
            labels = []
            with open(prefix + '.label', encoding='utf-8') as file:
                for line in file:
                    label = line.strip()
                    labels.append(
                        # Convert label to a numeric ID.
                        torch.LongTensor([self.label_vocab.add_symbol(label)])
                    )
    
            assert len(sentences) == len(labels)
            print('| {} {} {} examples'.format(self.args.data, split, len(sentences)))
    
            # We reuse LanguagePairDataset since classification can be modeled as a
            # sequence-to-sequence task where the target sequence has length 1.
            self.datasets[split] = LanguagePairDataset(
                src=sentences,
                src_sizes=lengths,
                src_dict=self.input_vocab,
                tgt=labels,
                tgt_sizes=torch.ones(len(labels)),  # targets have length 1
                tgt_dict=self.label_vocab,
                left_pad_source=False,
                max_source_positions=self.args.max_positions,
                max_target_positions=1,
                # Since our target is a single class label, there's no need for
                # input feeding. If we set this to ``True`` then our Model's
                # ``forward()`` method would receive an additional argument called
                # *prev_output_tokens* that would contain a shifted version of the
                # target sequence.
                input_feeding=False,
            )
    
        def max_positions(self):
            """Return the max input length allowed by the task."""
            # The source should be less than *args.max_positions* and the "target"
            # has max length 1.
            return (self.args.max_positions, 1)
    
        @property
        def source_dictionary(self):
            """Return the source :class:`~fairseq.data.Dictionary`."""
            return self.input_vocab
    
        @property
        def target_dictionary(self):
            """Return the target :class:`~fairseq.data.Dictionary`."""
            return self.label_vocab
    
        # We could override this method if we wanted more control over how batches
        # are constructed, but it's not necessary for this tutorial since we can
        # reuse the batching provided by LanguagePairDataset.
        #
        # def get_batch_iterator(
        #     self, dataset, max_tokens=None, max_sentences=None, max_positions=None,
        #     ignore_invalid_inputs=False, required_batch_size_multiple=1,
        #     seed=1, num_shards=1, shard_id=0,
        # ):
        #     (...)
    
    • 需要重写的两个类,返回fairseq中已经写好的字典类
    • 训练时候的方法,我们可以看到,task指定了如何加载数据,然后把加载好的数据放在self.datasets[split]里面,然后相应的architecture从这个里面拿到数据,其他的事情就不用管了比如怎么组织batch什么的,reuse the batching provided by LanguagePairDataset这个里面都实现好了。
    • setup_task里面是加载词表。
    > fairseq-train names-bin \
      --task simple_classification \
      --arch pytorch_tutorial_rnn \
      --optimizer adam --lr 0.001 --lr-shrink 0.5 \
      --max-tokens 1000
    
    • 写一个评估函数path里面存放的是模型的ckpt文件,可以看到加载加载数据是sentence = input('\nInput: ')这算方式所以可以使用管道来加载数据。
    • 使用data内置的collate函数来将数据组织成batch,然后输出到模型里面进行预测。
    from fairseq import data, options, tasks, utils
    
    # Parse command-line arguments for generation
    parser = options.get_generation_parser(default_task='simple_classification')
    args = options.parse_args_and_arch(parser)
    
    # Setup task
    task = tasks.setup_task(args)
    
    # Load model
    print('| loading model from {}'.format(args.path))
    models, _model_args = utils.load_ensemble_for_inference([args.path], task)
    model = models[0]
    
    while True:
        sentence = input('\nInput: ')
    
        # Tokenize into characters
        chars = ' '.join(list(sentence.strip()))
        tokens = task.source_dictionary.encode_line(
            chars, add_if_not_exist=False,
        )
    
        # Build mini-batch to feed to the model
        batch = data.language_pair_dataset.collate(
            samples=[{'id': -1, 'source': tokens}],  # bsz = 1
            pad_idx=task.source_dictionary.pad(),
            eos_idx=task.source_dictionary.eos(),
            left_pad_source=False,
            input_feeding=False,
        )
    
        # Feed batch to the model and get predictions
        preds = model(**batch['net_input'])
    
        # Print top 3 predictions and their log-probabilities
        top_scores, top_labels = preds[0].topk(k=3)
        for score, label_idx in zip(top_scores, top_labels):
            label_name = task.target_dictionary.string([label_idx])
            print('({:.2f})\t{}'.format(score, label_name))
    
    • 预测结果的命令:
      python eval_classifier.py names-bin --path checkpoints/checkpoint_best.pt

    解析命令行

    • preprocess命令行查找各种数据的方法

    Training Flow

    • Tasks的作用是存储字典以及给加载数据和训练中循环数据。
    • Training Flow中我们可以看到task.get_batch_iterator负责加载数据然后返回batch的生成器,lr_scheduler在每一次更新和每一个epoch的时候启用不同的学习率调度程序。
    for epoch in range(num_epochs):
        itr = task.get_batch_iterator(task.dataset('train'))
        for num_updates, batch in enumerate(itr):
            task.train_step(batch, model, criterion, optimizer)
            average_and_clip_gradients()
            optimizer.step()
            lr_scheduler.step_update(num_updates)
        lr_scheduler.step(epoch)
    
    • 默认的train.train_step
    def train_step(self, batch, model, criterion, optimizer):
        loss = criterion(model, batch)
        optimizer.backward(loss)
    
    • 使用定制的位置加载额外的module
    • 假设目录在
    /home/user/my-module/
    └── __init__.py
    
    • __init__.py里面是
    from fairseq.models import register_model_architecture
    from fairseq.models.transformer import transformer_vaswani_wmt_en_de_big
    
    @register_model_architecture('transformer', 'my_transformer')
    def transformer_mmt_big(args):
        transformer_vaswani_wmt_en_de_big(args)
    
    • 这样就可以在fairseq-train中加入新的结构
    fairseq-train ... --user-dir /home/user/my-module -a my_transformer --task translation
    

    task

    • task的用法,build_modelbuild_criterionload_datasetget_batch_iteratorget_loss基本上所有的功能都是task来完成的
    # setup the task (e.g., load dictionaries)
    task = fairseq.tasks.setup_task(args)
    
    # build model and criterion
    model = task.build_model(args)
    criterion = task.build_criterion(args)
    
    # load datasets
    task.load_dataset('train')
    task.load_dataset('valid')
    
    # iterate over mini-batches of data
    batch_itr = task.get_batch_iterator(
        task.dataset('train'), max_tokens=4096,
    )
    for batch in batch_itr:
        # compute the loss
        loss, sample_size, logging_output = task.get_loss(
            model, criterion, batch,
        )
        loss.backward()
    

    相关文章

      网友评论

          本文标题:[fairseq] tutorial

          本文链接:https://www.haomeiwen.com/subject/ijppgqtx.html