美文网首页
NLTK之文本歧义及其清理

NLTK之文本歧义及其清理

作者: writ | 来源:发表于2019-04-28 22:00 被阅读0次

    文本清理步骤

    屏幕快照 2019-04-28 下午21.32.08 下午.png

    语句分离器

    from nltk.tokenize import sent_tokenize
    import nltk.tokenize.punkt
    splitlist = sent_tokenize(inputstring)
    tokenizer = nltk.tokenize.punkt.PunktSentenceTokenizer
    

    标识化处理

    word_tokenize(s)
    

    词干提取

    from nltk.stem import PorterStemmer #Porter词干提取器
    from nltk.stem.lancaster import LancasterStemmer
    from nltk.stem.Snowball import SnowballStemmer #Snowball词干提取器
    pst = PorterStemmer()
    lst = LancasterStemmer()
    lst.stem("eating")
    #eat
    pst.stem("eating")
    #eat
    

    词形还原

    from nltk.stem import WordNetLemmatizer
    wlem =  WordNetLemmatizer()
    wlem.lemmatize("ate")
    #eat
    

    停用词移除

    from nltk.corpus import stopwords
    stoplist =stopwords.words('english')
    text = "this is just a test"
    cleanwordlist = [word for word in text.split() if word not in stoplist]
    #['test']
    

    罕见词移除

    freq_dist = nltk_FreqDist(token)
    rarewords = freq_dist.keys()[-50:]
    chuliwords = [ word for word in token not in rarewords]
    

    拼写纠错

    from nltk.metrics import edit_distance
    edit_distance()
    

    相关文章

      网友评论

          本文标题:NLTK之文本歧义及其清理

          本文链接:https://www.haomeiwen.com/subject/gzhsnqtx.html