一种基于文档拓扑的相似性搜索算法被引量：1

Topology-based document similarity search algorithm

下载PDF

导出

摘要从海量文档中快速有效地搜索到相似文档是一个重要且耗时的问题。现有的文档相似性搜索算法是先找出候选文档集,再对候选文档进行相关性排序,找出最相关的文档。提出了一种基于文档拓扑的相似性搜索算法——Hub-N,将文档相似性搜索问题转化为图搜索问题,应用相应的剪枝技术,缩小了扫描文档的范围,提高了搜索效率。通过实验验证了算法的有效性和可行性。 Searching for similar documents from the large number of documents quickly and efficiently is an important and time-consuming problem.The existing algorithms first find the candidate document set,and then sort them based on a document related evaluation to identify the most relevant ones.A topology-based document similarity search algorithm——Hub-N is put forward,and the document similarity search problem is transformed into graph search problem,applying the pruning techniques,reducing the scope of scanned documents,and significantly improving retrieval efficiency.It proves to be effective and feasible through experiment.

作者杨艳朱戈范文彬

机构地区黑龙江大学计算机科学技术学院黑龙江大学计算生物学重点实验室

出处《计算机工程与应用》 CSCD 北大核心 2011年第26期146-150,共5页 Computer Engineering and Applications

基金国家自然科学基金No.60973081 黑龙江省教育厅科学技术研究面上项目(No.11541263 No.11551352)~~

关键词文档拓扑相似性搜索相似度 document topology similarity search similarity

分类号 TP311 [自动化与计算机技术—计算机软件与理论]

引文网络
相关文献

参考文献12

1Salton G, Wong A, Yang C S.A vector space model for information retrieval[J].Communications of the ACM, 1975,18 ( 11 ) : 613 -620.
2Deerwester S, Dumais S T, Furnas G W, et al.Indexing by latent semantic analysis[J].Journal of the American Society for Information Science,1990,41(6):391-407.
3Croft W B.Document representation in probabilistic models of information retrieval[J].Joumal of the American Society for Information Science,1981,32(6):451-457.
4Joachims T.A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization[C]//Proceedings of the 14th International Conference on Machine Learning.Nashville,Tennessee:Morgan Kaufmarm Publishers, 1997: 143-151.
5Hofmann T.Probabilistic latent semantic indexing[C]//Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval,1999:50-57.
6Banmgarten C,Probabilistic information retrieval in a distributed heterogeneous environment[D].Dresden: Dresden University of Technology, 1999.
7Blei D M, Ng A Y, Jordan M I.Latent dirichlet allocation[J]. Journal of Machine Learning Research, 2003,3 (5) : 993 - 1022.
8Brin S, Page L.Anatomy of a large-scale hypertextual Web search engine[J].Computer Networks, 1998,30(1/7) : 107-117.
9Giles C L,Bollacker K D,Lawrence S.CiteSeer: an automatic citation indexing system[C]//Proceedings of the 3rd ACM Conferrace on Digital Library,Pittsburgh, 1998:89-98.
10Jeh G,Widom J.SimRank:a measure of structural-context similarity[C]//Proe of the 8th ACM SIGKDD International Confer- ence on Knowledge Discovery and Data Mining.Edmonton, Canada:ACM Press,2002: 538-543.

同被引文献7

1鲁宏伟,魏凯,孔华锋.一种改进的KMP高效模式匹配算法[J].华中科技大学学报（自然科学版）,2006,34(10):41-43. 被引量：26
2巫喜红,凌捷.BM模式匹配算法剖析[J].计算机工程与设计,2007,28(1):29-31. 被引量：19
3詹志辉,胡晓敏,张军.通过八数码问题比较搜索算法的性能[J].计算机工程与设计,2007,28(11):2505-2508. 被引量：18
4Wei-Lun Huang, Tzao-Lin Lee, Chiao-Szu Liao. Desktopsearch in the intranet with integrated desktop search en-gines[C]//. The Thirteenth IEEE Asia- Pacific ComputerSystems Architecture Conference, Taiwan, 2008(9):1-4.
5龚建华.深度优先搜索算法及其改进[J].现代电子技术,2007,30(22):90-92. 被引量：38
6蒋国瑞,赵林伟.基于本体的TBT文档检索系统研究[J].情报杂志,2009,28(10):136-140. 被引量：1
7赵俊杰.一种用于关键词检索的快速字符串精确匹配算法[J].计算机系统应用,2010,19(2):189-191. 被引量：7

引证文献1

1张令通,罗森林,陈燕颖.主机内文档自动搜索技术研究[J].科技通报,2014,30(3):108-112.

1孟均平,陈莉,马文宁,李华.图数据库中的相似性搜索算法研究与应用[J].计算机应用研究,2010,27(5):1813-1815. 被引量：5
2王忠伟,江虹.基于LSH的相似性搜索算法研究探讨[J].计算机光盘软件与应用,2015,18(2):89-90.
3朱国华,程传鹏.一种改进的KNN分类方法[J].河南工程学院学报（自然科学版）,2008,20(3):65-67. 被引量：1
4程传鹏,李钜.基于文本属性关联和概念共现的KNN分类方法[J].中原工学院学报,2009,20(4):27-29.
5杜红刚,吴岳忠.基于云存储的网络文档共享系统[J].湖南工业大学学报,2015,29(5):72-76. 被引量：1
6疾速滚轮浏览“无限” 全新罗技无线激光鼠标MX620[J].电脑迷,2007,0(14):21-21.
7姚莉.在MIS系统中实现对扫描仪的控制[J].吉林林学院学报,2000,16(2):123-124.
8张兆功,李建中.基于广义超曲面树的相似性搜索算法[J].软件学报,2002,13(10):1969-1976. 被引量：2
9盛鑫海,袁鑫攀,满君丰,涂慧.基于分组指纹的细粒度相似性检测系统[J].湖南工业大学学报,2014,28(6):81-85.
10毛云建,杜秀华.基于形态特征的时间序列相似性搜索算法[J].计算机仿真,2008,25(1):80-83. 被引量：5

计算机工程与应用

2011年第26期

浏览历史

内容加载中请稍等...

一种基于文档拓扑的相似性搜索算法被引量：1

参考文献12

同被引文献7

引证文献1

相关作者

相关机构

相关主题

浏览历史

一种基于文档拓扑的相似性搜索算法 被引量：1

参考文献12

同被引文献7

引证文献1

相关作者

相关机构

相关主题

浏览历史

一种基于文档拓扑的相似性搜索算法被引量：1