产生双连词(词对,两个单词的组合)使用nltk包下的ngram(text,2)切分出来的就是双连词。现阶段我理解的双连词就是能够将错误切分开的单词合在一起,例如Kunming university of science and technology(6元连词)就应该切分在一起。
生成随机文本
在条件概率的基础上,将所有的单词按照二元语法切分,切分之后构造条件频率分布,每次获取条件频率分布中出现次数最多的组合,然后输出连续的5个单词作为随机文本。
import nltk
from nltk.corpus import stopwords
#下面这个字符串算是训练数据吧
text = '' \
'On May 25, the “Experience China” social practice named "Artificial Intelligence +X" or the Opening Ceremony of Social Practice and Cultural Experience Base for Overseas Students Supported by Chinese Government Scholarship, hosted by China Scholarship Council and undertaken by Kunming University of Science and Technology, was held in Kunming. KUST invited 100 overseas student representatives funded by Chinese government scholarship from countries including Thailand, Laos, Vietnam, Pakistan, Zambia, Madagascar to participate in the event which had received strong support from Yunnan Guorong Zhichuang Artificial Intelligence Industrial Park, Yunnan Langyi Network Technology Co., Ltd and Yunnan Nengtou Weishi Technology Co., Ltd.'\
'At the ceremony, the head of the School of International Education of KUST and the chairman of Guorong Zhichuang Artificial Intelligence Industrial Park jointly opened the ceremony of the social practice and cultural experience base for overseas students funded by Chinese Government Scholarship.'\
'At the ceremony, the overseas students experienced artificial intelligence, face recognition, AR and VR systems and were given details about the 115kV power grid modernization project in Vientiane, capital of Laos and the connection project between Northern Kachin State and 230KV State Grid in Myanmar. They also touched upon the information about the design, manufacture, transportation and service of the ancillary control system of PAKE hydropower station in Vietnam and how the industrial automation control systems have been applied in other countries. To break the ice, the overseas students played a game called "Drawing Something", through which they showed their fluent Chinese and deep understanding of the Chinese culture. This game has also helped them to further understand and appreciate the Chinese characters, objects and architecture from the perspective of Chinese aesthetics.'\
'This social practice truly impressed the overseas students with the remarkable achievements China has made in the development of high-tech enterprises and technological innovation. It also deepened the communication and contact between KUST and enterprises and offered more opportunities for overseas students to complete placements in enterprises and for schools and enterprises to cooperate. Omega, a Nigeria student from KUST, said that this is his fourth year in China and that after graduation, he wishes to continue to stay in China to do what he can for the Belt and Road Initiative.'\
'The "Experience China" social practice and cultural experience program is a series of activities organized by the China Scholarship Council since 2015 with the aim to enhance the Chinese government scholarship students’ knowledge and understanding about Chinese contemporary development and its culture while building a sense of identity. This event helped the overseas students in KUST further broaden their horizons by learning about the latest achievements of Chinese high-tech enterprises and their management methods and corporate culture. They also had a chance to admire China’s natural and human landscapes.'\
'By integrating the resources of this kind across the country, the "Social Practice and Cultural Experience Base for Overseas Students of Chinese Government Scholarship " is to select and establish a number of state-level educational platforms as social practice and cultural experience bases for overseas students in China. By attending various activities including visits, experiences, lectures, academic exchanges, cultural exhibitions and social practices organized by the bases, those students funded by the Chinese government scholarship are expected to learn about China\'s national conditions and cultures, promote their friendship with local people and make contributions to the national education and foreign affairs.'\
'Translated by:LUO Man, Faculty of Foreign Languages and Cultures'\
'Edited by: LI Junrong, Faculty of Foreign Languages and Cultures'\
'Source: School of International Education'\
'Issued by: Division of Overseas Cooperation (English)'\
'Edited by: KUST News Center' \
''
#获得英文得停止词
stop_words = stopwords.words('english')
#对文本进行分词处理
tokens = nltk.tokenize.regexp_tokenize(text ,r'\w+')
#去除文本分词中得停止词
tokens = [word for word in tokens if word not in stop_words]
#按照二元语法组合切分的单词
word_next = nltk.bigrams(tokens)
#构造单词的条件分布
cfd = nltk.ConditionalFreqDist(word_next)
def getNext(word):
'''
:param word: 单词
:return: 查找出现最多组合返回单词,没有返回None
'''
try:
word = cfd[str(word)].max()
except Exception: #查找失败则抛出异常
return None
return word
word = input("请输入一个单词,程序预测它的下一个:")
#设置只查找5个单词
for i in range(5):
nxt = getNext(word)
print(nxt,end=" ")
word = nxt
结果是有意思的文本
ConditionalFreqDist对象的一些操作:
cfd操作
网友评论