期刊文献+

一种基于文档拓扑的相似性搜索算法 被引量:1

Topology-based document similarity search algorithm
下载PDF
导出
摘要 从海量文档中快速有效地搜索到相似文档是一个重要且耗时的问题。现有的文档相似性搜索算法是先找出候选文档集,再对候选文档进行相关性排序,找出最相关的文档。提出了一种基于文档拓扑的相似性搜索算法——Hub-N,将文档相似性搜索问题转化为图搜索问题,应用相应的剪枝技术,缩小了扫描文档的范围,提高了搜索效率。通过实验验证了算法的有效性和可行性。 Searching for similar documents from the large number of documents quickly and efficiently is an important and time-consuming problem.The existing algorithms first find the candidate document set,and then sort them based on a document related evaluation to identify the most relevant ones.A topology-based document similarity search algorithm——Hub-N is put forward,and the document similarity search problem is transformed into graph search problem,applying the pruning techniques,reducing the scope of scanned documents,and significantly improving retrieval efficiency.It proves to be effective and feasible through experiment.
出处 《计算机工程与应用》 CSCD 北大核心 2011年第26期146-150,共5页 Computer Engineering and Applications
基金 国家自然科学基金No.60973081 黑龙江省教育厅科学技术研究面上项目(No.11541263 No.11551352)~~
关键词 文档拓扑 相似性搜索 相似度 document topology similarity search similarity
  • 相关文献

参考文献12

  • 1Salton G, Wong A, Yang C S.A vector space model for information retrieval[J].Communications of the ACM, 1975,18 ( 11 ) : 613 -620.
  • 2Deerwester S, Dumais S T, Furnas G W, et al.Indexing by latent semantic analysis[J].Journal of the American Society for Information Science,1990,41(6):391-407.
  • 3Croft W B.Document representation in probabilistic models of information retrieval[J].Joumal of the American Society for Information Science,1981,32(6):451-457.
  • 4Joachims T.A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization[C]//Proceedings of the 14th International Conference on Machine Learning.Nashville,Tennessee:Morgan Kaufmarm Publishers, 1997: 143-151.
  • 5Hofmann T.Probabilistic latent semantic indexing[C]//Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval,1999:50-57.
  • 6Banmgarten C,Probabilistic information retrieval in a distributed heterogeneous environment[D].Dresden: Dresden University of Technology, 1999.
  • 7Blei D M, Ng A Y, Jordan M I.Latent dirichlet allocation[J]. Journal of Machine Learning Research, 2003,3 (5) : 993 - 1022.
  • 8Brin S, Page L.Anatomy of a large-scale hypertextual Web search engine[J].Computer Networks, 1998,30(1/7) : 107-117.
  • 9Giles C L,Bollacker K D,Lawrence S.CiteSeer: an automatic citation indexing system[C]//Proceedings of the 3rd ACM Conferrace on Digital Library,Pittsburgh, 1998:89-98.
  • 10Jeh G,Widom J.SimRank:a measure of structural-context similarity[C]//Proe of the 8th ACM SIGKDD International Confer- ence on Knowledge Discovery and Data Mining.Edmonton, Canada:ACM Press,2002: 538-543.

同被引文献7

引证文献1

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部