摘要
针对TF-IDF算法在加权时没有考虑特征词本身在文档中重要度的问题,提出利用用户阅读时的阅读行为来改进TF-IDF。将Document Triage引入到TF-IDF中,利用IPM收集用户阅读中行为的相关信息,计算文档评分。由于用户的标注内容往往是文章的重要内容,或者反映了用户的兴趣。因此,赋予用户标注词项更大的权重,将文档评分和用户的标注信息等作为因子引入到TF-IDF中,设计出改进的加权算法DT-TF-IDF。实验结果表明,相对传统TF-IDF算法,DT-TF-IDF的查全率、查准率,以及查准率和查全率的调和均值都有了一定的提高。DT-TF-IDF算法比传统TF-IDF算法更加有效,提高了文本相似度计算的准确性。
The Term Frequency-Inverse Document Frequency( TF-IDF) algorithm does not consider the importance of index items themselves in the document when computing the weights of index terms. In order to solve the problem, the users'behaviors when reading were utilized to improve the efficiency of TF-IDF. By introducing Document Triage to TF-IDF, the Interest Profile Manager( IPM) was used to collect data about users' reading behaviors, and then the document scores were computed. Since the users' annotation was quite important in the aimed text, or reflected the users' interest. The improved term weighting algorithm named Document Triage-Term Frequency-Inverse Document Frequency( DT-TF-IDF) was proposed by introducing document scores and users ' annotation to TF-IDF and giving a greater weight to annotated term. The experimental results show that the recall, the precision and their harmonic mean of DT-TF-IDF are all higher than those of the traditional TF-IDF algorithm. The proposed DT-TF-IDF algorithm is more effective than TF-IDF and has improved the accuracy of the text similarity calculation.
出处
《计算机应用》
CSCD
北大核心
2015年第12期3506-3510,3514,共6页
journal of Computer Applications