keras源码分析-Tokenizer

作者: Black先森 | 来源:发表于2019-10-10 22:29 被阅读0次

非常喜欢keras框架,平时都是使用封装好的API,基本完全可以满足需求,很少需要修改源码的。最近对keras的实现更加好奇了,于是花点时间读源码,然后整理点学习笔记吧。

我大致浏览了keras中文文档以及英文文档和源码,发现文档不太全面,很多源码实现的接口而文档中没有涉及到,于是萌生了自己整理分析源码的想法。

本文作为第一篇文档,先从预处理的tokenizer开始整理。

tokenizer是什么

计算机在处理语言文字时,是无法理解文字的含义,通常会把一个词(中文单个字或者词组认为是一个词)转化为一个正整数,于是一个文本就变成了一个序列。而tokenizer的核心任务就是做这个事情。

基本参数说明

keras.preprocessing.text.Tokenizer(num_words=None, filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n', lower=True, split=' ', char_level=False, oov_token=None, document_count=0)
  • num_words: the maximum number of words to keep, based on word frequency. Only the most common num_words-1 words will be kept.
  • filters: a string where each element is a character that will be filtered from the texts. The default is all punctuation, plus tabs and line breaks, minus the ' character.
  • lower: boolean. Whether to convert the texts to lowercase.
  • split: str. Separator for word splitting.
  • char_level: if True, every character will be treated as a token.
  • oov_token: if given, it will be added to word_index and used to replace out-of-vocabulary words during text_to_sequence calls[1]

  • num_words: 保留的最大词数,根据词频计算,保留前num_word - 1
  • filters: 过滤器,默认过滤掉常用的特殊符号
  • lower:是否转化为小写
  • split:词的分隔符
  • char_level:是否将每个字符都认为是词,默认是否。在处理中文时如果每个字都作为是词,这个参数改为True.
  • oov_token:如果给出,会添加到词索引中,用来替换超出词表的字符
  • document_count:文档个数,这个参数一般会根据喂入文本自动计算,无需给出

几个重要接口

这里我直接截图了keras的中文文档[2]。有一个小问题,这是对象或者实例的方法,而不是类方法。

image image

源码分析

def fit_on_texts(self, texts):
        """Updates internal vocabulary based on a list of texts.
        基于文本列表,更新内部词典,主要是word_index,和index_word这两个属性

        In the case where texts contains lists,
        we assume each entry of the lists to be a token.

        Required before using `texts_to_sequences` or `texts_to_matrix`.

        # Arguments
            texts: can be a list of strings, 
            字符串列表
                a generator of strings (for memory-efficiency),
                字符串的生成器
                or a list of list of strings.
                列表中嵌套的列表字符串
        """
        
        for text in texts:
            self.document_count += 1 # 更新文档数
            if self.char_level or isinstance(text, list):
                if self.lower:
                    if isinstance(text, list):
                        text = [text_elem.lower() for text_elem in text] # 将所有字符转为小写
                    else:
                        text = text.lower()
                seq = text # seq存储文本的词序列,单个字或者词作为元素
            else:
                seq = text_to_word_sequence(text,
                                            self.filters,
                                            self.lower,
                                            self.split) # 文本转为词序列,这个接口单独分析
            # self.word_counts是一个有序字典,用来统计词频
            for w in seq:
                if w in self.word_counts:
                    self.word_counts[w] += 1
                else:
                    self.word_counts[w] = 1
            for w in set(seq):
                # In how many documents each word occurs
                self.word_docs[w] += 1

        wcounts = list(self.word_counts.items())
        wcounts.sort(key=lambda x: x[1], reverse=True) # 按照词频降序排序
        # forcing the oov_token to index 1 if it exists
        # 强制把oov_token的索引设置为1,0通常是padding的补充值
        # 是否指定超出词典的标记
        if self.oov_token is None:
            sorted_voc = []
        else:
            sorted_voc = [self.oov_token]
        sorted_voc.extend(wc[0] for wc in wcounts)

        # note that index 0 is reserved, never assigned to an existing word
        # 更新word_index
        self.word_index = dict(
            list(zip(sorted_voc, list(range(1, len(sorted_voc) + 1)))))
        # 更新index_word
        self.index_word = dict((c, w) for w, c in self.word_index.items())

        for w, c in list(self.word_docs.items()):
            self.index_docs[self.word_index[w]] = c

接口的实现思路总结:将输入的文本列表先拆成词,然后统计每个词的词频,并存入有序字典中。将字段元素转为列表,并且降序排列。根据这个排序的列表可以得到word_index和index_word。之后的把文本转为词序列texts_to_sequences或者把词序列转为文本sequences_to_texts,依赖这两个词表。

    def texts_to_sequences_generator(self, texts):
        """Transforms each text in `texts` to a sequence of integers.

        Each item in texts can also be a list,
        in which case we assume each item of that list to be a token.

        Only top `num_words-1` most frequent words will be taken into account.
        Only words known by the tokenizer will be taken into account.

        # Arguments
            texts: A list of texts (strings).

        # Yields
            Yields individual sequences.
        """
        num_words = self.num_words # 保留最常用的词数
        oov_token_index = self.word_index.get(self.oov_token) # 获取oov_token的词索引
        for text in texts:
            if self.char_level or isinstance(text, list):
                if self.lower:
                    if isinstance(text, list):
                        text = [text_elem.lower() for text_elem in text]
                    else:
                        text = text.lower()
                seq = text
            else:
                seq = text_to_word_sequence(text,
                                            self.filters,
                                            self.lower,
                                            self.split)
            vect = [] # 存储返回结果
            for w in seq:
                # 注意这里的word_index是根据词频的降序排列的
                i = self.word_index.get(w) # 获取词索引
                if i is not None: # 拿到了词索引
                    # 指定了num_words 并且词索引大于num_words
                    if num_words and i >= num_words:
                        if oov_token_index is not None: # oov_token 的词索引不为空
                            vect.append(oov_token_index) # 将这个词当成 oov_token
                    else:
                        vect.append(i) # 没有指定num_words或者i<num_words 加入
                elif self.oov_token is not None:
                    vect.append(oov_token_index)
            yield vect # 生成器的返回
            # 这里有个问题,没有指定num_words或者i<num_words ,此时也没有指定oov_token,那么这个词将会被忽略

接口的实现思路总结:获取到词索引,然后判断是否满足返回条件。

  • 如果词索引没有拿到,会试图用oov_token填充;如果oov_token也没有指定,那就直接忽略掉
  • 拿到词索引,判读是否指定num_words,以及词索引是否大于num_words

texts_to_sequences底层直接调用了这个生成器。

    def sequences_to_texts_generator(self, sequences):
        """Transforms each sequence in `sequences` to a list of texts(strings).

        Each sequence has to a list of integers.
        In other words, sequences should be a list of sequences

        Only top `num_words-1` most frequent words will be taken into account.
        Only words known by the tokenizer will be taken into account.

        # Arguments
            sequences: A list of sequences.

        # Yields
            Yields individual texts.
        """
        num_words = self.num_words
        oov_token_index = self.word_index.get(self.oov_token)
        for seq in sequences:
            vect = []
            for num in seq:
                word = self.index_word.get(num) # 根据词索引获取到词
                if word is not None: # 如果词不为空
                    if num_words and num >= num_words: # num_words指定了并且词索引大于等于num_words
                        if oov_token_index is not None: # 指定了oov_token
                            vect.append(self.index_word[oov_token_index]) # 这个词就是 oov_token
                    else:
                        vect.append(word) # 没指定oov_token 或者num < num_words
                elif self.oov_token is not None: # word 为空 但是oov_token 不为空
                    vect.append(self.index_word[oov_token_index])# 这个词也是 oov_token
            vect = ' '.join(vect) # 词序列拼接成字符串
            yield vect

接口分析:实现思路在注释基本清楚了。sequences_to_texts直接调用了这个生成器。

    def get_config(self):
        '''Returns the tokenizer configuration as Python dictionary.
        The word count dictionaries used by the tokenizer get serialized
        into plain JSON, so that the configuration can be read by other
        projects.

        # Returns
            A Python dictionary with the tokenizer configuration.
        '''
        json_word_counts = json.dumps(self.word_counts)
        json_word_docs = json.dumps(self.word_docs)
        json_index_docs = json.dumps(self.index_docs)
        json_word_index = json.dumps(self.word_index)
        json_index_word = json.dumps(self.index_word)

        return {
            'num_words': self.num_words,
            'filters': self.filters,
            'lower': self.lower,
            'split': self.split,
            'char_level': self.char_level,
            'oov_token': self.oov_token,
            'document_count': self.document_count,
            'word_counts': json_word_counts,
            'word_docs': json_word_docs,
            'index_docs': json_index_docs,
            'index_word': json_index_word,
            'word_index': json_word_index
        }

   def to_json(self, **kwargs):
        """Returns a JSON string containing the tokenizer configuration.
        To load a tokenizer from a JSON string, use
        `keras.preprocessing.text.tokenizer_from_json(json_string)`.

        # Arguments
            **kwargs: Additional keyword arguments
                to be passed to `json.dumps()`.

        # Returns
            A JSON string containing the tokenizer configuration.
        """
        config = self.get_config()
        tokenizer_config = {
            'class_name': self.__class__.__name__,
            'config': config
        }
        return json.dumps(tokenizer_config, **kwargs)

接口分析:to_json是把tokenizer对象序列化,并且以json的格式存储起来。存储以后肯定要提供一个接口来反序列化得到tokenizer,这个反序列的接口是tokenizer_from_json.

def tokenizer_from_json(json_string):
    """Parses a JSON tokenizer configuration file and returns a
    tokenizer instance.

    # Arguments
        json_string: JSON string encoding a tokenizer configuration.

    # Returns
        A Keras Tokenizer instance
    """
    tokenizer_config = json.loads(json_string)
    config = tokenizer_config.get('config')

    word_counts = json.loads(config.pop('word_counts'))
    word_docs = json.loads(config.pop('word_docs'))
    index_docs = json.loads(config.pop('index_docs'))
    # Integer indexing gets converted to strings with json.dumps()
    index_docs = {int(k): v for k, v in index_docs.items()}
    index_word = json.loads(config.pop('index_word'))
    index_word = {int(k): v for k, v in index_word.items()}
    word_index = json.loads(config.pop('word_index'))

    tokenizer = Tokenizer(**config)
    tokenizer.word_counts = word_counts
    tokenizer.word_docs = word_docs
    tokenizer.index_docs = index_docs
    tokenizer.word_index = word_index
    tokenizer.index_word = index_word

    return tokenizer

总结

本文大致分析了keras的Tokenizer类中比较重要的参数,属性以及对象的方法。这个分词器主要是把文本转化为词序列,同时也提供了词序列转为文本的接口。源码非常清晰简洁,功能基本完善,如果需要实现部分定制化的功能,继承这个类,添加一些接口也非常简单。比如我需要删除低频词而不是设置保留词。在面对大量文本时,保留词的个数很难确定,具体是2万还是1.5万不好设置,但是对于低频词是容易界定的。


  1. https://keras.io/preprocessing/text/

  2. https://keras-cn-docs.readthedocs.io/zh_CN/latest/preprocessing/text/

相关文章

网友评论

    本文标题:keras源码分析-Tokenizer

    本文链接:https://www.haomeiwen.com/subject/zmmzyctx.html