期刊文献+

基于上下文多元信息的文档相似度计算研究 被引量:2

Research on document similarity computing based on multi-grmns of context
下载PDF
导出
摘要 提出一种基于上下文多元信息实现文档相似度计算的方法,该方法首先抽取文档的特征词,对具有相同(或相近)意义特征词的文档,分别获得特征词在上下文中同现词的词性、语义信息、位置关系、平均同现概率等多元信息,以量化形式描述成一个相似函数;然后分别从两两文档的相似函数中得到文档的相似度评价值,作为衡量文档相似程度的重要依据.利用该评价方法,使用NTCIR-3中的跨语言信息检索数据集中的中文文档,对初始检索文档的顺序重新排列,实验结果表明,该方法分别将前10个最佳召回文档和前100个最佳召回文档的平均精确度提高了15.45%-18.49%和11.96%~15.35%;在另一组有关相同网页信息的实验中,几组不同类别文档相似度F1-measure平均值均在95%以上. A novel solution of computing document similarity based on multi-grams of context is presented in this paper. In this study, the same feature information firstly is acquired from document pairs; and then, the usage of co-occurrence feature information is gotten in the context of speech, semantic, location, weighted average co-occurrence probability, and is expressed as the similarity function; finally, document similarity evaluation value is calculated for each document, The similarity evaluation value plays an important role in judging the document similarity degree. The Chinese document set from the NTCIR-3 workshop collection is used to evaluate the method, it shows that an average 15,45%-18.49% and 11.96%-15.35% increase in precision can be achieved at top 10 and 100 ranking documents level respectively. In another group experiment about the same Web information, average FTmeasure of textual similarity is above 95 %.
出处 《哈尔滨工程大学学报》 EI CAS CSCD 北大核心 2006年第B07期397-402,共6页 Journal of Harbin Engineering University
基金 国家自然科学基金资助项目(60302021):黑龙江省自然科学基金资助项目(F2004-04).
关键词 相似度计算 上下文 多元信息 相似函数 知识获取 similarity computing context multi-grams similarity function knowledge acquisition
  • 相关文献

参考文献14

  • 1SALTON G,BUCKLEY C.Term weighting approaches in automatic text retrieval[J].Information Processing and Management,1988,24(5):513-523.
  • 2BESANCON R,RAJMAN M,CHAPPELIER J C.Textual similarities based on a distributional approach[A].The Tenth International Workshop on Database and Expert Systems Applications[C].Florence,Italy,1999:180-184.
  • 3COOPER J W,CODEN A R,BROWN E W.A novel method for detecting similar documents[A].Proceedings of the 35th Annual Hawaii International Conference on System Sciences[C].Hawaii,2002:1153-1159.
  • 4VLADIMIR O,ASLE P.Ontology based semantic similarity comparison of documents[A].14th International Workshop on Database and Expert Systems Applications[C].Prague,Czech Republic,2003:735~738.
  • 5潘谦红,王炬,史忠植.基于属性论的文本相似度计算[J].计算机学报,1999,22(6):651-655. 被引量:63
  • 6张焕炯,王国胜,钟义信.基于汉明距离的文本相似度计算[J].计算机工程与应用,2001,37(19):21-22. 被引量:55
  • 7CARBONELL J,GOLDSTEIN J.The use of MMR,diversity-based reranking for reordering documents and producing summaries[A].Proceedings of 21st ACM-SIGIR'98[C].Melbourne,Australia,1998:675-685.
  • 8CHRIS H,DING Q.A similarity-based probability model for latent semantic indexing[A].Proceedings of 22nd ACM-SIGIR'99[C].Berkeley,America,1999:59-65.
  • 9穗志方 俞士汶.基于骨架依存树的语句相似度计算模型[A]..中文信息处理国际会议论文集(ICCIP''98)[C].北京:清华大学出版社,1998.458-465.
  • 10李彬,刘挺,秦兵,李生.基于语义依存的汉语句子相似度计算[J].计算机应用研究,2003,20(12):15-17. 被引量:126

二级参考文献9

共引文献229

同被引文献15

引证文献2

二级引证文献19

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部