期刊文献+

一种基于科技查新的跨库检索去重算法 被引量:2

A Duplicate Removal Algorithm of Cross-database Search Based on Sci-tech Novelty Retrieval
原文传递
导出
摘要 【目的】通过对科技查新中的跨库检索结果进行去重,提高查新检索效率。【方法】选取不同数据库检索记录中唯一性的特征四元组{论文名称,期刊名,发表时间,第一作者}信息,用改进的I-Match中的对比算法构建检索记录特征字串作为去重的计算依据。【结果】跨库检索去重算法对数据库检索结果进行初步分析和去重,提高查新检索效率。通过测试,算法去重准确率较高,而召回率受数据库收录信息完善度的影响,还有提高的空间。【局限】算法处理效果依赖于从数据库检索记录中提取特征四元组,由于不同数据库的检索返回结果存在差异,需要针对不同论文数据库定制检索记录特征抽取模板。【结论】通过实验测试,算法具有较高的去重准确率和处理效率,符合预定科技查新需求。 [Objective] Remove the data redundancy of cross-database searching in sci-tech novelty retrieval and improve the retrieval efficiency. [Methods] Choose thesis names, serial titles, publication dates and first authors of search records from different databases and build the character strings of search records by modifying comparison algorithm related to I-Match as the evidence of duplicate removal. [Results] The duplicate removal algorithm can improve retrieval effeciency by analyzing and duplicating the retrieval results from different databases. The experient suggests the precision of algorithm is superior, while the recall of the algorithm could be improved by modifying database records. [Limitations] The treatment effect depends on four characters extracted from database search records, different feature extraction model of search records needed to be customized according to different thesis databases due to the search result diffenrence. [Conclusions] The experiment test suggests the algorithm has a decent precision of duplicate removal and treatment efficency, which accords with the requirement of sci-tech retreival.
作者 郝慧
出处 《现代图书情报技术》 CSSCI 2015年第1期89-95,共7页 New Technology of Library and Information Service
关键词 跨库检索 科技查新 去重算法 I-Match Cross-database search Sci-tech novelty retrieval Duplicate removal algorithm I-Match
  • 相关文献

参考文献6

  • 1李雪婷,李莘,王晓丹.基于JAVA的图书馆中文查新智能去重系统的研究与实现[J].图书馆学研究,2013(17):56-58. 被引量:5
  • 2洪道广.Google Scholar的数据整合研究[J].现代情报,2010,30(7):39-41. 被引量:8
  • 3Broder A Z, Glassman S C, Manasse S, et al. Syntactic Clustering of the Web [C]. In: Proceedings of the 6th International World Wide Web Conference. Essex, UK: Elsevier Science Publishers, 1997: 1157-1166.
  • 4Broder A Z. Identifying and Filtering Near-duplicate Documents [C]. In: Proceedings of the 11th Annual Symposium on Combinatorial Pattern Matching (COM'00). London,UK: Springer-Verlag, 2000: 1-10.
  • 5Chowdhury A, Frieder O, Grossman D, et al. Collection Statistics for Fast Duplicate Document Detection [J]. ACM Transactions on Information Systems, 2002, 20(2): 171-191.
  • 6Charikar M S. Similarity Estimation Techniques from Rounding Algorithms [C]. In: Proceedings of the 34th Annual ACM Symposium on Theory of Computing (STOC'02). New York, USA: ACM, 2002: 380-388.

二级参考文献11

共引文献11

同被引文献19

引证文献2

二级引证文献3

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部