期刊文献+

基于改进的TF-IDF方法的文本相似度算法研究 被引量:10

Research on Text Similarity Algorithm Based on Improved TF-IDF Strategy
下载PDF
导出
摘要 传统的文本相似度算法采用关键词频率表示该关键词在文档中的重要程度,关键词在类别内不同文档中的频率波动使得关键词的权值产生不稳定性,导致文本之间的相似度运算不够准确.本文提出一种基于词语信息量的改进的TF-IDF算法计算关键词的权值,将得到的权值运用于向量空间模型和马尔可夫模型中,分别得到基于向量空间模型的基础相似度和基于马尔可夫模型的语义相似度,将语义相似度和基础相似度相结合,得到文本之间总体相似度.将改进的文本相似度算法运用于文本分类,实验结果表明,在搜狗文本分类语料库基础上,改进的算法相对于传统的文本相似度算法使得文本分类的准确率有了较大地提高. Traditional text similarity algorithm uses term's frequency to show the importance of the term in a document, the continuously changing frequency of a term in different documents which has common category makes the termg weight unstable, causing a low precision rate of text similarity calculation. We propose an improved TF - IDF strategy based on term's information capacity to calculate the term's weight, the obtained term's weight is used in vector space model and Markov model to acquire the fundamental similarity based on vector space model and semantic similarity based on Markov model, combining similarity and semantic similarity, the overall similarity between texts is got by combining fundamental similarity and semantic similarity. The experimental results on an open benchmark datasets from Sogou show our proposed approach can improve the accuracy and F1 performance of classification compared to traditional approach.
出处 《泰山学院学报》 2015年第3期18-22,共5页 Journal of Taishan University
基金 国家自然科学基金资助项目(61401060 61272173) 山东省高等学校科技计划基金资助项目(J12LN73)
关键词 文本相似度算法 TF-IDF方法 词语关联 马尔可夫模型 文本分类 text similarity algorithm TF - IDF strategy word - relation Markov model text categorization
  • 相关文献

参考文献7

二级参考文献50

  • 1徐燕,李锦涛,王斌,孙春明,张森.不均衡数据集上文本分类的特征选择研究[J].计算机研究与发展,2007,44(z2):58-62. 被引量:20
  • 2王晓黎,王文杰.基于向量空间模型的文本检索系统[J].微电子学与计算机,2006,23(6):188-190. 被引量:18
  • 3张玉芳,彭时名,吕佳.基于文本分类TFIDF方法的改进与应用[J].计算机工程,2006,32(19):76-78. 被引量:120
  • 4Sung Jin Kim, Sang Ho Lee. An Empirical Study on the Change of Web Pages[C]//Proc. of Conf. on Web Technologies Research and Development. Heidelberg, Germany: Springer, 2005: 632-642.
  • 5Cho J, Garcia-molina H. Parallel Crawlers[C]//Proceedings of the 11th International World Wide Web Conference. [S. l.]: IEEE Press, 2002.
  • 6Cho Junghoo, Garcia-molina H. The Evolution of the Web and Implications for an Incremental Crawler[C]//Proceedings of VLDB'0. Seou, Korea: [s. n.], 2000.
  • 7Salton G, Buckley C. Term-weighting Approaches in Automatic Retrieval[J]. Information Processing and Management, 1998, 24(5): 513-523.
  • 8De Bra D, Post R D. Searching for Arbitrary Information in the WWW: The Fish Search for Mosaic[C]//Proceedings of the 2nd World Wide Web Conference. Chicaco, IL, USA: [s. n.], 1994.
  • 9Fetterly D, Manasse M, Najork M, et al. A large-scale Study of the Evolution of Web Pages[C]//Proceedings of the 12th World Wide Web Conference. New York, NY, USA: ACM Press, 2003.
  • 10车万翔,刘挺,秦兵,等.面向双语句对检索的汉语句子相似度计算[C]//全国第七届计算语言学联合学术会议论文集.北京:清华大学出版社,2003:81-88.

共引文献48

同被引文献72

引证文献10

二级引证文献83

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部