摘要
传统的文本相似度算法采用关键词频率表示该关键词在文档中的重要程度,关键词在类别内不同文档中的频率波动使得关键词的权值产生不稳定性,导致文本之间的相似度运算不够准确.本文提出一种基于词语信息量的改进的TF-IDF算法计算关键词的权值,将得到的权值运用于向量空间模型和马尔可夫模型中,分别得到基于向量空间模型的基础相似度和基于马尔可夫模型的语义相似度,将语义相似度和基础相似度相结合,得到文本之间总体相似度.将改进的文本相似度算法运用于文本分类,实验结果表明,在搜狗文本分类语料库基础上,改进的算法相对于传统的文本相似度算法使得文本分类的准确率有了较大地提高.
Traditional text similarity algorithm uses term's frequency to show the importance of the term in a document, the continuously changing frequency of a term in different documents which has common category makes the termg weight unstable, causing a low precision rate of text similarity calculation. We propose an improved TF - IDF strategy based on term's information capacity to calculate the term's weight, the obtained term's weight is used in vector space model and Markov model to acquire the fundamental similarity based on vector space model and semantic similarity based on Markov model, combining similarity and semantic similarity, the overall similarity between texts is got by combining fundamental similarity and semantic similarity. The experimental results on an open benchmark datasets from Sogou show our proposed approach can improve the accuracy and F1 performance of classification compared to traditional approach.
出处
《泰山学院学报》
2015年第3期18-22,共5页
Journal of Taishan University
基金
国家自然科学基金资助项目(61401060
61272173)
山东省高等学校科技计划基金资助项目(J12LN73)