预处理

作者: Jakai | 来源:发表于2017-08-19 09:02 被阅读0次
    stopwords = nltk.corpus.stopwords.words("english")
    eng_stopwords = set(stopwords)
    def clean_text(text):
        text = BeautifulSoup(text, 'html.parser').get_text()
        text = re.sub(r'[^a-zA-Z]', ' ', text)
        words = text.lower().split()
        words = [w for w in words if w not in eng_stopwords]
        return ' '.join(words)
    

    相关文章

      网友评论

          本文标题:预处理

          本文链接:https://www.haomeiwen.com/subject/xbzzrxtx.html