摘要
伴随着互联网技术的发展,文本数量的爆发式增长带来了处理文本数据的一些困扰,传统的文本聚类以及关键词提取的技术不能很好解决对大数据进行精准筛选的需求。对此,提出利用基于LDA算法的潜在语义模型来对文本进行文本聚类,得到了对文本进行聚类的结果和LDA提取出来的主题词语;然后利用FP-growth算法对LDA算法的结果进行分析,对文本进行挖掘,得到中文关键词集;借助网络知识库的思想,利用百度百科提出了汉语比对算法对中文关键词集进行筛选,过滤掉了很多噪声词。实验表明,本文的方法可以很好地对给定的中文语料文本进行文本聚类和关键词提取,特别是在增加了基于百度百科远程学习的筛选之后,系统的准确率有大幅度的提高。
With the development of Internet technology,the explosive growth in the number of text has brought some troubles in processing text data.The traditional text clustering and keyword extraction technology cannot solve the need for precise screening of large data very well.This paper combines text clustering and keyword extraction.The text clustering based on LDA algorithm is proposed.The results of clustering and the subject terms extracted from LDA are obtained.Then the FP-growth algorithm is used to analyze the results of the LDA algorithm,and the text is mined.In this paper,according to the idea of using the network knowledge base,the Baidu encyclopedia is used to put forward the Chinese comparison algorithm to select the Chinese keyword set and filter out a lot of noise words.Experimental results show that the method can cluster text and extract keyword perfectly for a given Chinese corpus by comparing with the existing method.On the basis of increasing the word selection of Baidu encyclopedia,the accuracy of the system is greatly improved.
作者
曹聪慧
兰强
侯群
漆为民
CAO Cong-hui;LAN Qiang;HOU Qun;QI Wei-min(School of Artificial Intelligence,Jianghan University,Wuhan 430056,Hubei;Dongfeng Motor Finance Co.,Ltd.,Wuhan 430056,Hubei)
出处
《电脑与电信》
2021年第8期1-5,9,共6页
Computer & Telecommunication
基金
湖北省教育厅科学研究计划指导性项目,项目编号:B2020224
江汉大学湖北省重点学科管理科学与工程2019年度开放性课题,项目编号:ZDXK2019YB05
江汉大学高层次人才科研启动经费,项目编号:2019032。
关键词
文本聚类
关键词提取
LDA算法
远程学习
汉语比对算法
text clustering
keyword extraction
LDA algorithm
remote learning
Chinese alignment algorithm