中文分句

作者: elephantnose | 来源:发表于2019-07-02 15:53 被阅读0次

在NLP相关任务中，有些预处理可能需要将文本按句子划分。于是在网上搜了一些别人写好的代码片段，但是在使用的过程中多多少少会发生一些问题。最多的问题就是双引号没有正确划分为一句，导致 “可能被分在上一句，”却被分到下一句的情况，于是自己动手写了一段代码处理来解决诸如此类的问题，在一些测试文本上测试效果还可以，代码如下：

import re


class SentenceSplitter(object):
    def __init__(self):
        self.end_operator = ["。", "？", "……", "…", "~", "！"]

    def cut(self, text):
        sentence_set = [sentence.strip() for sentence in text.split("\n") if sentence.strip()]
        res = []

        for sentence in sentence_set:
            try:
                for s in self.rules(sentence, []):
                    res.append(s)
            except Exception as e:
                res.append(sentence)
                print(e)
                print(sentence)

        return res

    def rules(self, sentence, cut_list):
        self.text = sentence
        if not self.text:
            return cut_list
        
        # 如果为对话 则切分
        if self.text[0] == "“":
            couple_index = self.text.find("”")
            # 如果匹配到
            if couple_index != -1:
                if self.text[couple_index-1] in self.end_operator:
                    cut_list.append(self.text[:couple_index+1])
                    # 继续切分余下的句子
                    text = self.text[couple_index+1:]
                    return self.rules(text, cut_list)
                else:
                    end_operator = re.search("|".join(self.end_operator), self.text)
                    if end_operator:
                        end_operator_index = self.text.index(end_operator.group())
                        if "“" not in self.text[couple_index:end_operator_index]:
                            cut_list.append(self.text[:end_operator_index+1])
                            # 继续切分余下的句子
                            text = self.text[end_operator_index+1:]
                            return self.rules(text, cut_list)
                        else:
                            couple_index_2 = self.text[couple_index+1:].find("”")
                            if couple_index_2 != -1:
                                couple_index_2 = couple_index_2 + couple_index+1
                                if couple_index_2 == len(self.text) - 1:
                                    cut_list.append(self.text)
                                    return cut_list
                                elif self.text[couple_index_2 - 1] in self.end_operator:
                                    cut_list.append(self.text[:couple_index_2 + 1])
                                    # 继续切分余下的句子
                                    text = self.text[couple_index_2 + 1:]
                                    return self.rules(text, cut_list)
                                elif self.text[couple_index_2 + 1] in self.end_operator:
                                    cut_list.append(self.text[:couple_index_2 + 2])
                                    # 继续切分余下的句子
                                    text = self.text[couple_index_2 + 2:]
                                    return self.rules(text, cut_list)
                                else:
                                    cut_list.append(self.text)
                                    return cut_list
                    else:
                        cut_list.append(self.text)
                        return cut_list
            # 错误符号用法直接返回该句子
            else:
                cut_list.append(self.text)
                return cut_list
                
        else:
            end_operator = re.search("|".join(self.end_operator), self.text)
            if end_operator:
                end_operator_index = self.text.index(end_operator.group())
                # xxxxxxx。
                if "“" not in self.text[:end_operator_index] or \
                        ("“" in self.text[:end_operator_index] and "”" in self.text[:end_operator_index]):
                    if end_operator.group() == "……":
                        end_operator_index += 1

                    if "”" in self.text[end_operator_index+1:]:
                        couple_index = self.text[end_operator_index+1:].find("”") + end_operator_index + 1
                        if couple_index == len(self.text) - 1:
                            cut_list.append(self.text)
                            return cut_list
                        elif self.text[couple_index-1] in self.end_operator:
                            cut_list.append(self.text[:couple_index+1])
                            text = self.text[couple_index+1:]
                            return self.rules(text, cut_list)
                        else:
                            cut_list.append(self.text[:end_operator_index + 1])
                            # 继续切分余下的句子
                            text = self.text[end_operator_index + 1:]
                            return self.rules(text, cut_list)
                    else:
                        cut_list.append(self.text[:end_operator_index+1])
                        # 继续切分余下的句子
                        text = self.text[end_operator_index+1:]
                        return self.rules(text, cut_list)
                # “xxxxxx。xxxx”
                else:
                    couple_index = self.text.find("”")
                    # 如果引号在句子末尾直接返回， 不再切割
                    if couple_index == len(self.text) - 1:
                        cut_list.append(self.text)
                        return cut_list
                    elif couple_index != -1:
                        if self.text[couple_index-1] in self.end_operator:
                            cut_list.append(self.text[:couple_index+1])
                            # 继续切分余下的句子
                            text = self.text[couple_index+1:]
                            return self.rules(text, cut_list)
                        elif self.text[couple_index+1] in self.end_operator:
                            cut_list.append(self.text[:couple_index+2])
                            # 继续切分余下的句子
                            text = self.text[couple_index+2:]
                            return self.rules(text, cut_list)
                        else:
                            cut_list.append(self.text)
                            return cut_list
                    else:
                        cut_list.append(self.text)
                        return cut_list
            else:
                cut_list.append(self.text)
                return cut_list


if __name__ == "__main__":
    ss = SentenceSplitter()
    cut_res = ss.cut("“我爷爷蔡开铭1933年参加红五军团34师，在湘江战役中英勇牺牲。”来自福建省长汀县的红军后代蔡金旺眼含泪花，“今天，我带来了一顶红军斗笠。这顶斗笠的样式是1932年冬天，毛泽东同志在长汀期间亲手改的，当时他把尖顶宽边的粤军斗笠样式，改成了现在的平顶缠边样式，行军路上不磨衣，雨天遮雨、晴天当扇子，休息可当枕头、当坐垫。”")
    for s in cut_res:
        print(s)

运行结果如下：

“我爷爷蔡开铭1933年参加红五军团34师，在湘江战役中英勇牺牲。”
来自福建省长汀县的红军后代蔡金旺眼含泪花，“今天，我带来了一顶红军斗笠。这顶斗笠的样式是1932年冬天，毛泽东同志在长汀期间亲手改的，当时他把尖顶宽边的粤军斗笠样式，改成了现在的平顶缠边样式，行军路上不磨衣，雨天遮雨、晴天当扇子，休息可当枕头、当坐垫。”

使用时需实例化一个 SentenceSplitter 对象 ss, 然后直接调用 cut 方法即可。

支持整篇文档按句子切割
仅支持正确使用中文符号的文章的切分，如文章标点使用不当可能造成句子切割不准

网友评论

工作生活

本文标题：中文分句

本文链接：https://www.haomeiwen.com/subject/shazcctx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！

中文分句

相关文章

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读

工作生活