期刊文献+

一种相似重复元数据记录检测方法 被引量:3

Method for Approximately Duplicate Metadata Record Detection
下载PDF
导出
摘要 对联邦数字图书馆中重复元数据记录进行检测和管理,是保证元数据质量、提高联邦检索服务质量的关键。针对现有联邦数字图书馆中重复记录检测方法计算集中、准确度不高等缺点,提出一种快速高效的相似重复元数据记录检测方法,该方法基于改进的N-Gram方法,适合较大规模联邦数字图书馆。模拟实验结果表明,该方法能有效提高重复检测的性能,加快重复检测的速度。 Metadata records duplicate detection and management of federated digital library are one of key issues to ensure metadata quality and improve federal retrieval services. Many duplicate record detection methods exist for conventional federated digital library, but they are computationally intensive and low accuracy and so on. This paper proposes an efficient duplication approach for a relatively large federated digital library based on improved N-Gram method. Simulation experimental results show that the method improve the performance of duplicate detection effectively, accelerate the rate of duplicate detection.
出处 《计算机工程》 CAS CSCD 北大核心 2009年第21期85-87,共3页 Computer Engineering
基金 河北省自然科学基金资助项目(F2008000877)
关键词 元数据 重复记录检测 N-Gram方法 相似度 metadata duplicate record detection N-Gram method similarity
  • 相关文献

参考文献5

  • 1Harrison T L, Elango A, Bollen J, et al. Initial Experiences Re-exporting Duplicate and Similarity Computations with an OAI-PMH Aggregator[R]. Norfolk, VA, USA: Old Dominion University, Tech. Rep.: cs.DL/0401001, 2004.
  • 2] Khan H M, Maly K, Zubair M. Similarity and Duplicate Detection System for an OAI Compliant Federated Digital Library[C]//Proc. of ECDL'05. Vienna, Austria: [s. n.], 2005.
  • 3Foulonneau M. Information Redundancy Across Metadata Collections[J]. Information Processing and Management, 2007, 43(3): 740-751.
  • 4Yang Hui, Callan J. Near-duplicate Detection by Instance-level Constrained Clustering[C]//Proc. of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Seattle, Washington, USA: ACM Press, 2006.
  • 5Newman D, Hagedom K, Smyth C C P. Subject Metadata Enrichment Using Statistical Topic Models[C]//Proc. of JCDL'07. Vancouver, Canada: ACM Press, 2007.

同被引文献37

  • 1陈振洲,李磊,姚正安.基于SVM的特征加权KNN算法[J].中山大学学报(自然科学版),2005,44(1):17-20. 被引量:51
  • 2陈细谦,迟忠先,昃宗亮,苏立强.地理编码在空间数据仓库ETL中的应用[J].小型微型计算机系统,2005,26(4):628-630. 被引量:11
  • 3刘伟,曹先彬.对基于MPN的相似重复记录识别算法的改进[J].微计算机信息,2005,21(08X):147-149. 被引量:6
  • 4张永,迟忠先,闫德勤.数据仓库ETL中相似重复记录的检测方法及应用[J].计算机应用,2006,26(4):880-882. 被引量:15
  • 5张永,迟忠先.位置编码在数据仓库ETL中的应用[J].计算机工程,2007,33(1):50-52. 被引量:12
  • 6Mange A. An Adaptive and Efficient Algorithm for Detecting Approximately Duplicate Database Records [ EB/OL]. ( 2007 - 09 - 02). [ 2010 - 11 - 01 ]. http ://citeseer. ist. psu. edu/mon- geovadaptive, html.
  • 7Monge A E, Elkan C P. An Efficient Domain - independent Algorithm for Detecting Approximately Duplicate Database Records [ C ]. In: Proceedings of the SIFMOD Workshop on Data Mining and Knowledge Discovery, Tuscan, Arizona, United States. 1997 : 23 - 29.
  • 8Foulonneau M. Information Redundancy Across Metadata Collections [ J ]. Information Processing and Management, 2007, 43 (3) :740 -751.
  • 9Liang J, Chen L, Mehrotra S. Efficient Record Linkage in Large Data Sets[ C ]. In: Proceedings of the 8th International Conference on Database Systems for Advanced Applications, Kyoto, Japan. 2003 : 137 - 148.
  • 10Chandhurt S, Ganjam K, Ganti V, et al. Robust and Efficient Fuzzy Match for Online Data Cleaning [ C ]. In : Proceedings of ACM SIGMOD International Conference Management of Data. New York : ACM Press ,2003:313 - 324.

引证文献3

二级引证文献9

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部