需求背景
由于业务要求,我需要对数百万的短文本进行分词操作.目前只有jieba在手~需要使用并行分词,提高效率.
代码实现
import pandas as pd
import jieba
from multiprocessing import Pool
def segment(text):
seg_list = jieba.cut(text)
return '-'.join(seg_list)
def parallel_segment(df):
with Pool() as pool: # 创建进程池,根据实际情况设置进程数
df['segmented_text'] = pool.map(segment, df['message'])
return df
if __name__ == "__main__":
with open("tmp.txt","r",encoding="utf-8") as f:
data=f.read()
tmp=data.split("\n")
df=pd.DataFrame(tmp)
df.columns=["message"]
df=df[df.message.str.len()>10]
df_segmented = parallel_segment(df)
df_segmented.to_pickle("result.pickle")
在自己的电脑上跑了下,15233,大概花了4.86s.感觉不知道哪里不对...好像有些快.
网友评论