期刊文献+

结合关键词微变和LD算法的文本相似性研究

Research of text similarity combining micro variation of keywords and LD algorithm
下载PDF
导出
摘要 为了解决基于传统向量空间模型的文本相似性算法没有考虑向量高维及关键词的微变,而导致文本相似性计算结果不够精确的问题,提出了关键词微变情况下基于聚类和LD算法的文本相似性算法TSABCLDA(Text Similarity Algorithm Based on Clustering and LD Algorithm)。对文本进行移除数字、标点符号和停用词等预处理;采用聚类的方法约简文本中的低频词,利用LD算法计算特征词间的相似度,建立文本相似度矩阵;用特征词相似度及其权重构建的空间向量计算文本间的相似度,这样不仅考虑了关键词微变的情况,而且有效地解决了文本向量的高维问题,将其应用于文本挖掘中,能够提高相似文本的挖掘效率。实验结果表明,由于考虑了关键词微变情况,在一定的阈值范围内,该算法文本相似性的准确率得到了明显的提高。 In order to solve the problem of the imprecise calculation result of text similarity which comes from text similarity algorithm based on traditional vector space model, it doesn't consider vector dimension and micro variation of key word, proposes TSABCLDA(Text Similarity Algorithm Based on Clustering and LD Algorithm)with the situation of micro variation of key word. In the present work, it makes some pretreatment of removing the number, punctuation and stop word. It reduces the low-frequency words in the text with clustering method, calculates the similarity between characteristic words by LD algorithm, builds text similarity matrix. It calculates the similarity between texts by characteristic words similarity matrix and space vector which is built by weight. It not only considers the micro variation situation of key word, but also solves the high dimensional problems of text effectively. If applied to text mining, it will improve the efficiency of mining of similarity text. The experimental results show that precise of the algorithm is improved obviously with the discovery of similarity text in situation of micro variation and a certain range of threshold values.
出处 《计算机工程与应用》 CSCD 北大核心 2016年第8期70-73,124,共5页 Computer Engineering and Applications
基金 安徽省高校省级自然科学研究项目(No.KJ2013A177) 安徽省自然科学基金(No.10040606Q42)
关键词 聚类 LD算法 文本相似度矩阵 向量空间模型 文本相似性 clustering LD algorithm text similarity matrix vector space model text similarity
  • 相关文献

参考文献19

二级参考文献61

共引文献166

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部