期刊文献+

基于Document Triage的TF-IDF算法的改进 被引量:14

Improvement of term frequency-inverse document frequency algorithm based on Document Triage
下载PDF
导出
摘要 针对TF-IDF算法在加权时没有考虑特征词本身在文档中重要度的问题,提出利用用户阅读时的阅读行为来改进TF-IDF。将Document Triage引入到TF-IDF中,利用IPM收集用户阅读中行为的相关信息,计算文档评分。由于用户的标注内容往往是文章的重要内容,或者反映了用户的兴趣。因此,赋予用户标注词项更大的权重,将文档评分和用户的标注信息等作为因子引入到TF-IDF中,设计出改进的加权算法DT-TF-IDF。实验结果表明,相对传统TF-IDF算法,DT-TF-IDF的查全率、查准率,以及查准率和查全率的调和均值都有了一定的提高。DT-TF-IDF算法比传统TF-IDF算法更加有效,提高了文本相似度计算的准确性。 The Term Frequency-Inverse Document Frequency( TF-IDF) algorithm does not consider the importance of index items themselves in the document when computing the weights of index terms. In order to solve the problem, the users'behaviors when reading were utilized to improve the efficiency of TF-IDF. By introducing Document Triage to TF-IDF, the Interest Profile Manager( IPM) was used to collect data about users' reading behaviors, and then the document scores were computed. Since the users' annotation was quite important in the aimed text, or reflected the users' interest. The improved term weighting algorithm named Document Triage-Term Frequency-Inverse Document Frequency( DT-TF-IDF) was proposed by introducing document scores and users ' annotation to TF-IDF and giving a greater weight to annotated term. The experimental results show that the recall, the precision and their harmonic mean of DT-TF-IDF are all higher than those of the traditional TF-IDF algorithm. The proposed DT-TF-IDF algorithm is more effective than TF-IDF and has improved the accuracy of the text similarity calculation.
出处 《计算机应用》 CSCD 北大核心 2015年第12期3506-3510,3514,共6页 journal of Computer Applications
关键词 TF-IDF DOCUMENT TRIAGE 标引 加权 Term Frequency-Inverse Document Frequency(TF-IDF) Document Triage annotation weighting
  • 相关文献

参考文献13

  • 1韩如冰,叶得学.基于VSM的权重改进文档相似度算法研究[J].软件,2012,33(10):103-105. 被引量:9
  • 2SALTON G. The SMART retrieval system: experiments in automatic document processing [ M]. Upper Saddle River: Prentice Hall, 1971:45-62.
  • 3台德艺,王俊.文本分类特征权重改进算法[J].计算机工程,2010,36(9):197-199. 被引量:26
  • 4苏力华,朱章华,白文华,.基于向量空间模型的文本分类特征权重算法研究[J].电脑知识与技术(过刊),2010,0(33):9327-9329. 被引量:4
  • 5BADI R, BAE S, MOORE J M, et al. Recognizing user interest and document value from reading and organizing activities in document triage [ C]//Proceedings of the 11 th International Conference on In- telligent User Interfaces. New York: ACM, 2006:218-225.
  • 6SHIPMAN F, PRICE M, MARSHALL C C. Identifying useful pas- sages in documents based on annotation patterns [ C]//Proccedings of the 7th European Conference on Research and Advanced Technol- ogy for Digital Libraries, LNCS 2769. Berlin: Springer, 2013:101 - 112.
  • 7SU X, KHOSHGOFTAAR T M. A survey of collaborative filtering techniques [J]. Advances in Artificial Intelligence, 2009, 2009: Article No. 4.
  • 8ZHOU Z, JAYARATHNA S, PATRA A, et al. IPM-G: enabling collaborative filtering using multi-application interest models [ C]// Proceedings of the 2014 9th International Conference on Semantics, Knowledge and Grids. Piscataway: IEEE, 2014: 141- 144.
  • 9MARSHALL K, WANG S. Annotation persistence over dynamic documents [ D]. Boston: Massachusetts Institute of Technolo, 2009:19-43.
  • 10OVSIANNIKOV I A, ARBIB M A, MCHE1LL T H. Annotation technology [ J]. International Journal of Human-Computer Studies, 2010, 24(5): 329 -362.

二级参考文献37

  • 1闫宏飞,陈翀.词汇与中心词的距离信息对问句相似度匹配的影响[J].清华大学学报(自然科学版),2005,45(S1):1873-1877. 被引量:8
  • 2唐焕玲,孙建涛,陆玉昌.文本分类中结合评估函数的TEF-WA权值调整技术[J].计算机研究与发展,2005,42(1):47-53. 被引量:26
  • 3蒋盛益,李庆华,李新.数据流挖掘算法研究综述[J].计算机工程与设计,2005,26(5):1130-1132. 被引量:21
  • 4郑家恒,卢娇丽.关键词抽取方法的研究[J].计算机工程,2005,31(18):194-196. 被引量:41
  • 5Rocchio J.The SMART Retrieval System:Experiments in Automatic Document Processing[M].Englewood Cliffs,USA:Prentice-Hall,1971.
  • 6Salton G,Buckley C.Term Weighting Approaches in Automatic Text Retrieval[J].Information Processing and Management,1988,24(5):513-523.
  • 7Salton G.Developments in Automatic Text Retrieval[J].Science,1991,253(5023):974-979.
  • 8Sebastiani F.Machine Learning in Automated Text Categoriza-tion[J].ACM Computing Surveys,2002,34(1):1-47.
  • 9Shankar S,Karypis G.A Feature Weight Adjustment Algorithm for Document Categorization[C]//Proc.of KDD'00.New York,USA:ACM Press,2000.
  • 10Forman G.BNS Feature Scaling:An Improved Representation over TF-IDF for SVM Text Classification[C]//Proc.of the 12th ACM Conference on Information and Knowledge Management.Napa Valley,CA,USA:ACM Press,2008:26-30.

共引文献81

同被引文献105

引证文献14

二级引证文献97

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部