一种基于科技查新的跨库检索去重算法被引量：2

A Duplicate Removal Algorithm of Cross-database Search Based on Sci-tech Novelty Retrieval

导出

摘要【目的】通过对科技查新中的跨库检索结果进行去重,提高查新检索效率。【方法】选取不同数据库检索记录中唯一性的特征四元组{论文名称,期刊名,发表时间,第一作者}信息,用改进的I-Match中的对比算法构建检索记录特征字串作为去重的计算依据。【结果】跨库检索去重算法对数据库检索结果进行初步分析和去重,提高查新检索效率。通过测试,算法去重准确率较高,而召回率受数据库收录信息完善度的影响,还有提高的空间。【局限】算法处理效果依赖于从数据库检索记录中提取特征四元组,由于不同数据库的检索返回结果存在差异,需要针对不同论文数据库定制检索记录特征抽取模板。【结论】通过实验测试,算法具有较高的去重准确率和处理效率,符合预定科技查新需求。 [Objective] Remove the data redundancy of cross-database searching in sci-tech novelty retrieval and improve the retrieval efficiency. [Methods] Choose thesis names, serial titles, publication dates and first authors of search records from different databases and build the character strings of search records by modifying comparison algorithm related to I-Match as the evidence of duplicate removal. [Results] The duplicate removal algorithm can improve retrieval effeciency by analyzing and duplicating the retrieval results from different databases. The experient suggests the precision of algorithm is superior, while the recall of the algorithm could be improved by modifying database records. [Limitations] The treatment effect depends on four characters extracted from database search records, different feature extraction model of search records needed to be customized according to different thesis databases due to the search result diffenrence. [Conclusions] The experiment test suggests the algorithm has a decent precision of duplicate removal and treatment efficency, which accords with the requirement of sci-tech retreival.

作者郝慧

机构地区北京工业大学图书馆

出处《现代图书情报技术》 CSSCI 2015年第1期89-95,共7页 New Technology of Library and Information Service

关键词跨库检索科技查新去重算法 I-Match Cross-database search Sci-tech novelty retrieval Duplicate removal algorithm I-Match

分类号 G252.62 [文化科学—图书馆学] G252.7 [文化科学—图书馆学]

引文网络
相关文献

参考文献6

1李雪婷,李莘,王晓丹.基于JAVA的图书馆中文查新智能去重系统的研究与实现[J].图书馆学研究,2013(17):56-58. 被引量：5
2洪道广.Google Scholar的数据整合研究[J].现代情报,2010,30(7):39-41. 被引量：8
3Broder A Z, Glassman S C, Manasse S, et al. Syntactic Clustering of the Web [C]. In: Proceedings of the 6th International World Wide Web Conference. Essex, UK: Elsevier Science Publishers, 1997: 1157-1166.
4Broder A Z. Identifying and Filtering Near-duplicate Documents [C]. In: Proceedings of the 11th Annual Symposium on Combinatorial Pattern Matching (COM'00). London,UK: Springer-Verlag, 2000: 1-10.
5Chowdhury A, Frieder O, Grossman D, et al. Collection Statistics for Fast Duplicate Document Detection [J]. ACM Transactions on Information Systems, 2002, 20(2): 171-191.
6Charikar M S. Similarity Estimation Techniques from Rounding Algorithms [C]. In: Proceedings of the 34th Annual ACM Symposium on Theory of Computing (STOC'02). New York, USA: ACM, 2002: 380-388.

二级参考文献11

1周凤敏.高校图书馆科技查新服务模式的案例探析[J].图书情报工作,2011,55(S2):184-186. 被引量：9
2夏旭.基于Google学术搜索的引文检索研究[J].情报理论与实践,2006,29(6):697-701. 被引量：17
3Google学术搜索帮助[EB].http://scholar.google.com/intl/en/scholar/about.html.
4Kayvan Kousha and Mike Thelwall,Google Scholar Citations and Google Web/Url Citations:A Multi-discipline Exploratory Analysis,Journal of the American Society for Information Science and Technology,2007,58(7):1055-1065.
5John J.Meicr and Thomas W.Conkling,Google Scholar's Coverage of the Engineering Literature:An Empirical Study,The Journal of Aca-dense Librafianship,2008,34(3):196-201.
6William H.Walters,Google Scholar coverase of a multidisciplinary field,Information Processing & Management,2007,43(4):1121-1132.
7Google学术搜索中文版[EB].http://scholar.google.com.hk,2010-02-02.
8Google学术搜索英文版[EB].http://scholar.google.com,2010-02-02.
9陈家翠,谷玉荣.Google学术搜索检索性能的分析及评价[J].情报理论与实践,2007,30(5):653-655. 被引量：12
10洪道广.Google Scholar与工程索引的检索比较[J].现代情报,2009,29(11):125-127. 被引量：4