
网络表格间的快照关系发现 被引量:1

Detecting Snapshots for Web Tables
摘要 近年来,互联网上涌现出大量结构化的表格数据,网络表格的价值不仅在于数据本身,还在于数据之间的关系。只有探测出表格之间潜在的关系,方能更好地利用这些结构化数据。因此提出发现网络表格间的快照关系,并给出发现快照关系的框架以及检测与给定表之间满足某种匹配关系的快照表的算法,快照表可用于优化查询以及在大数据环境下实时地返回部分查询结果。提出了基于实体和属性重合度的评分方法,并引入实体新鲜度的概念,使得算法在快照关系的发现过程中更多地关注能提供新鲜实体的表;与此同时,基于Bayes模型的表格内容增强算法能更加准确地判断属性列上值的一致性,从而提高快照关系发现的准确率。大量实验表明,该评分模型能发现高质量的快照表,且在快照的查询精度和召回率上表现出色。 In recent years,a large number of structured tabular data have emerged on the Internet constantly.However,the value of Web tables depends not only on the data itself,but also on the relatedness between the data.Only when the potential relatedness between them is detected,can these structured data be fully utilized.We proposed a new type of relatedness between Web tables called snapshot relationship,and a framework for capturing snapshots that meet a certain matching condition with a given table.The snapshots are beneficial for query optimization,and also helpful for returning partial results rapidly when querying on big data.The relatedness between an original Web table and its snapshot can be computed based on entity consistency and schema consistency.In order to assign high weights on tables which provide more fresh entities,the concept of entity freshness was introduced into our scoring method.Meanwhile,the content consistency of Web tables can be enhanced by applying Bayesian analysis to our relatedness capturing framework.As a consequence,accuracy of finding snapshots is improved.Extensive experiments demonstrate that the algorithms can capture snapshots with high quality,and perform well in query precision and recall.
作者 王宁 任红伟
出处 《计算机科学》 CSCD 北大核心 2015年第7期5-11,共7页 Computer Science
基金 国家自然科学基金项目(61370060) 江苏省自然科学基金项目(BK2011454)资助
关键词 网络表格 关联关系 快照 数据集成 查询优化 Web tables Relatedness Snapshot Data integration Query optimization
  • 相关文献


  • 1Cafarella M J, Halevy A, Wang D Z, et al. WebTables: Exploring the Power of Tables on the Web [J]. Proceedings of the VLDB Endowment, 2008,1 (1) : 538-549.
  • 2Gonzalez H, Halevy A,Jensen C S, et al. Google Fusion Tables: Data Management, Integration and Collaboration in the Cloud [C]ffProc of the 1st ACM symposium on Cloud computing. New York: ACM, 2010 : 175-180.
  • 3Wang J, Wang H, Wang Z, et al. Understanding Tables on the Web [M]. New York: Springer, 2012.
  • 4Venetis P, Halevy A, Madhavan J, et al. Recovering Semantics of Tables on the Web [J]. Proceedings of the VLDB Endow- ment, 2011,4(9) : 528-538.
  • 5Yakout M, GanjamK, Chakrabarti K, et al. InfoGather: Entity Augmentation and Attribute Discovery by Holistic Matching with Web Tables[C]//Proc of the 2012 ACM SIGMOD Int Conf on Management of Data. New York: ACM, 2012:97-108.
  • 6Dong X L, Berti-Equille L, Srivastava D. Truth Discovery andCopying Detection in a Dynamic World [J]. Proceedings of the VLDB Endowment, 2009,2 ( 1 ) : 562-573.
  • 7Sarma A D, Fang L, Gupta N, et al. Finding Related Tables [C]// Proc of the 2012 ACM SIGMOD Int Conf on Management of Data. New York: ACM, 2012 : 817-828.
  • 8Eberius J, Thiele M, Braunschweig, et al. DrillBeyond.- Enabling Business Analysts to Explore the Web of Open Data [J]. Pro- ceedings of the VLDB Endowment, 2012,5(12):1978-1981.
  • 9Theodoros R, Xin L D, Divesh S. Characterizing and Selecting Fresh Data Sources[C]//Proc of the 2014 ACM SIGMOD Int Conf on Management of Data. New York: ACM, 2014 : 919-930.
  • 10孟小峰,慈祥.大数据管理:概念、技术与挑战[J].计算机研究与发展,2013,50(1):146-169. 被引量:2399


  • 1Nature. Big Data [EB/OL]. [2012-10-02]. http,//www. nature, com/news/specials/bigdata/index, html.
  • 2Bryant R E, Katz R H, Lazowska E D. Big-Data computing : Creating revolutionary breakthroughs in commerce, science, and society [R]. [2012-10-02]. http:// www. cra. org/ccc/docs/init/Big_Data, pdf.
  • 3Science. Special online collection: Dealing with data [EB/OL]. [2012-10-02]. http://www, sciencemag, org/site/ special/data/, 2011.
  • 4Agrawal D, Bernstein P, Bertino E, et al. Challenges and opportunities with big data A community white paper developed by leading researchers across the United States [R/OL]. [2012-10-02]. http://cra, org/ccc/docs/init/bigdata whitepaper, pdf.
  • 5Manyika J, Chui M, Brown B, et al. Big data: The next frontier for innovation, competition, and productivity [R/OL]. [ 2012-10-02 ]. http://www, mekinsey, corn/ Insights]MGI[Research/Teehnology _ and _ Innovation]Big _ data The next frontier for innovation.
  • 6World Economic Forum. Big data, big impact: New possibilities for international development [R/OL]. [2012- 10-02]. http://www3, weforum, org/docs/WEF TC MFS BigDataBigImpact_Briefing 2012. pdf.
  • 7Big Data Across the Federal Government [EB/OL]. [2012-10-02]. http://www, whitehouse, gov/sites/default/ files/microsites/ostp/big_data fact sheet_final_ 1. pdf.
  • 8UN Global Pulse. Big Data for Development:Challenges Opportunities [R/OL]. [ 2012-10-02 ]. http://www. unglobalpulse, org/proj ects/BigDataforDevelopment.
  • 9Times N Y. The age of big data fEB/OLd. [2012-10 -02]. http://www, nytimes, com/2012/02/12/sunday review/big- datas-impact in-the-world, html?pagewanted=all.
  • 10Grobelnik M. Big-data computing: Creating revolutionary breakthroughs in commerce, science, and society [R/OL]. [2012-10 -02]. http://videolectures, net/cswc2012_grobelnik_ big_data/.











使用帮助 返回顶部