期刊文献+

基于Web的重复属性自动识别方法

Automatic Web-based duplicate attribute resolution method
下载PDF
导出
摘要 在建立数据仓库的过程中,需要从多个数据源导入数据。这些数据存在大量相似重复记录,严重影响了数据利用率和决策质量。因此,相似重复记录的检测已经成为数据仓库等领域的热点研究问题,而重复属性的识别是完成相似重复记录检测的关键。提出一种高效的基于Web的重复属性自动识别算法,该算法使用搜索引擎返回的摘要和URL信息计算属性相似度,并使用查询探针提高查询准确度。实验结果表明该算法有较高的查全率。 It needs to import data from different sources when building a data warehouse, which results in the increase of approximately duplicated records, adversely affecting the data utilization and the quality of making decisions. Therefore, detecting approximately duplicated records has become an intensive research subject. Duplicate attribute resolution is the key of approximately duplicated records detecting. This paper proposes an efficient, Web-based algorithm to automatically recognize duplicated attributes. This novel algorithm uses snippets and URL information returned by search engine to calcu-late attribute similarity, and further improves recognition precision by inserting query probes. The results show that the algo-rithm improves recall.
出处 《计算机工程与应用》 CSCD 北大核心 2015年第9期125-128,共4页 Computer Engineering and Applications
基金 国家"973"重点基础发展规划基金(No.2012CB316203) 西北工业大学研究生种子基金(No.Z2013125 No.Z2013126)
关键词 重复属性识别 WEB搜索 摘要 查询探针 URL duplicate attribute resolution Web search snippet URL query probe
  • 相关文献

参考文献15

  • 1杨先娣,彭智勇,刘君强,李旭辉.信息集成研究综述[J].计算机科学,2006,33(7):55-59. 被引量:35
  • 2郭志懋,周傲英.数据质量和数据清洗研究综述[J].软件学报,2002,13(11):2076-2082. 被引量:265
  • 3Khan H M,Maly K,Zubair M.Similarity and duplicate detection system for an OAI compliant federated digital library[C]//Proceedings of the 9th European Conference on Research and Advanced Technology for Digital Libraries.Berlin:Springer-Verlag,2005:531-532.
  • 4Foulonneau M.Information redundancy across metadata collections[J].Information Processing&Management,2007,43(3):740-751.
  • 5Elmagarmid A K,Ipeirotis P G,Verykios V S.Duplicate record detection:a survey[J].IEEE Transactions on Knowledge and Data Engineering,2007,19(1):1-16.
  • 6Tan Y F,Elmacioglu E,Kan M Y,et al.Efficient webbased linkage of short to long forms[C]//International Workshop on the Web and Databases(Web DB),2008.
  • 7Elmacioglu E,Kan M Y,Lee D,et al.Web based linkage[C]//Proceedings of the 9th Annual ACM International Workshop on Web Information and Data Management,2007:121-128.
  • 8Lu Zhiqiang,Shao Werimin,Yu Zhenhua.Measuring semantic similarity between words using Wikipedia[C]//Web Information Systems and Mining(WISM),2009.
  • 9张玉芳,张泓博,熊忠阳.语义相似度计算在语义标注中的应用[J].计算机工程与应用,2013,49(4):153-156. 被引量:4
  • 10Dagan I,Lee L,Pereira F C N.Similarity-based models of word cooccurrence probabilities[J].Machine Learning,1999,34(113):43-69.

二级参考文献88

  • 1唐静.叙词表转换为Ontology的研究[J].情报理论与实践,2004,27(6):642-645. 被引量:36
  • 2秦春秀,赵捧未,窦永香.一种基于本体的语义标引方法[J].情报理论与实践,2005,28(3):244-246. 被引量:7
  • 3王灿辉,张敏,马少平.自然语言处理在信息检索中的应用综述[J].中文信息学报,2007,21(2):35-45. 被引量:50
  • 4熊文新,宋柔.信息检索用户查询语句的停用词过滤[J].计算机工程,2007,33(6):195-197. 被引量:16
  • 5Ram A.Interest-based information filtering and extraction in nat- ural language understanding systems[C]//Proceedings of the Bell- core Workshop on High Performance Information Filterling. 1991.
  • 6Hofmann T.Probabilistic latent semantic indexing[C]//Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval,New York, NY,USA, 1999.
  • 7Cinque L, Malizia A,Navigli R.A semantic-based system for que- rying personal digital libraries[J].Document Analysis Systems, 2004, 3163: :39-46.
  • 8Voorhees E M.Query expansion using lexical-semantic relations[C]// Proceedings of the 17th Annual International ACM SIGIR Con- ference on Research and Development in Information Refrieral. New York: Springer-Verlag, 1994.
  • 9Wu Z, Palmer M.Verb semantics and lexical selection[C]//Pro- ceedings of the 32rid Annual Meeting of the Associations for Computational Linguistics, 1994.
  • 10Yang Y, Pedersen J.A comparative study on feature selection in text categorization[C]//Proceedings of the 14th International Con- ference on Machine Learning.San Francisco,CA,USA: Morgan Kaufmann Publishers Inc, 1997.

共引文献302

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部