期刊文献+

大数据集成中确定数据准确属性值的WR方法 被引量:1

WR Approach:Determining Accurate Attribute Values in Big Data Integration
下载PDF
导出
摘要 大数据集成是提供高质量数据以进行决策的基础.集成的一个关键环节是根据实体在数据库中的不同元组确定其准确属性值.最新的R-topK方法在数据上实施人工设计的规则确定属性值间的准确程度,得到了相对准确的属性值.然而这种方法在处理多个可能的准确值或设计的规则存在冲突等情况下需要较多人工交互.为此提出基于权重规则的WR(weighted-rule)方法确定大数据集成中数据的准确属性值.该方法为属性值间准确程度的判断规则扩充了权重,在准确值发生冲突时避免了R-topK方法中人工交互干预.基于追逐过程设计了约束条件推理算法,并证明它能够在O(n/+2)内推导出每对属性值间的带权重的准确程度,形成推导准确属性值的约束条件.面对约束条件中可能的冲突,提出了目标求解算法,在O(n)时间内从所有属性值组合中搜索最可能的准确属性值.在真实和合成数据集中进行了充分的实验,验证了WR方法的效果和效率.WR方法较R-topK方法在性能上提高了3~15倍,在效果上提升7%~80%. Big data integration lays the foundation for high quality data-driven decision. One critical section thereof is to determine the accurate attribute values from records in data pertaining to a given entity. The state-of-the-art approach R-topK argues to design rules to decide relative accuracy among the attribute values and thus obtain accurate values. Unfortunately, in cases where multiple true values or conflicted rules exist, it requires rounds of human intervention. In this paper, we propose a weighted rule (WR) approach for determining accurate attribute values in big data integration. Each rule is augmented with weight and thus avoid human intervention when conflicts occur. This paper designs a chase procedure-based inference algorithm, and proves that it can figure out weighted constraints over relative accuracy among attribute values in O(n/+2), which introduces constraints for finding accurate data values. Taking conflicts among constraints into consideration, this paper proposes an O(n) algorithm to discover accurate attribute values among the combination of data values. We conduct extensive experiments under real world and synthetic datasets, and the results demonstrate the effectiveness and efficiency of WR approach. WR approach boosts performance by factor of 3-15x and improves effectiveness by 7%-80%.
出处 《计算机研究与发展》 EI CSCD 北大核心 2016年第2期449-458,共10页 Journal of Computer Research and Development
基金 国家"九七三"重点基础研究发展计划基金项目(2014CB340403) 国家电网公司研究项目(EPRIPDKJ[2014]3763号)~~
关键词 大数据集成 数据质量 数据准确性 数据清洗 权重规则 big data integration data quality data accuracy data cleaning weighted rules
  • 相关文献

参考文献26

  • 1Dong Xin Luna, Srivastava D. Big data integration[C]//Proc ofICDE'13. Piscataway, NJ: IEEE, 2013: 1245-1248.
  • 2Dong Xin Luna, Srivastava D. Big data integration [J]. Proceedings of the VLDB Endowment, 2013, 6 ( 11 ) : 1188- 1189.
  • 3Bellahsene Z, Bonifati A, Rahm E. Schema Matching and Mapping [M]. Berlin: Springer, 2011.
  • 4Gelman I. Setting priorities for data accuracy improvements in satisficing decision-making scenarios: A guiding theory [J]. Decision Support System, 2010, 48(1): 507-520.
  • 5Getoor L, Machanavajjhala A. Entity resolution: Theory, practice & open challenges [J]. Proceedings of the VLDB Endowment, 2012, 5(12): 2018-2019.
  • 6Fan Wenfei. Querying Big Social Data [M]. Berlin: Springer, 2013:14-28.
  • 7Cao Yang, Fan Wenfei, Yu Wenyuan. Determining the relative accuracy of attributes[C]//Proc of SIGMOD'13. New York: ACM, 2013: 565-576.
  • 8Radcliffe J, White A. Key issues for master data management, G00210255 [R]. Stanford, CT: Gartner, 2008.
  • 9Fan Wenfei, Geerts F, Nan Tang, et al. Inferring data currency and consistency for conflict resolution [C] //Proc of ICDE'13. Piscataway, NJ: IEEE, 2013:470-481.
  • 10Abiteboul S, Hull R, Vianu V. Foundations of Databases [M]. Reading, MA: Addison-Wesley, 1995.

二级参考文献10

  • 1Eckerson W W. Data quality and the bottom line: Achieving business success through a commitment to high quality data. Data Warehousing Institute: Technical Report TDWI Report Series, 2002.
  • 2Zhang H, Diao Y, Immerman N. Recognizing patterns in streams with imprecise timestamps. Proceedings of the VLDB Endowment, 2010, 3(1-2): 244-255.
  • 3Fan W, Geerts F, Wijsen J. Determining the currency of data//Proceedings of the ACM Symposium on Principles of Database Systems(PODS). Athens, Greece, 2011:71-82.
  • 4Berti-EquiUe L, Sarma A D, Dong X, Marian A, Srivastava D.Sailing the information ocean with awareness of currents: Discovery and application of source dependence//Proceedings of the Conference on Innovative Data Systems Research (CIDR). Asilomar, CA, USA, 2009.
  • 5Dong X, Berti-Equille L, Hu Y, Srivastava D. Global detec- tion of complex copying relationships between sources. Pro- ceedings of the VLDB Endowment, 2010, 3(1 2) : 1358-1369.
  • 6Dong X, Berti-Equille L, Srivastava D. Truth discovery and copying detection in a dynamic world. Proceedings of the VLDB Endowment, 2009, 2(1) : 562-573.
  • 7Clifford J, Dyreson C E, Isakowitz T, Jensen C S, Snodgrass R T. On the semantics of "now" in databases. ACM Transactions on Database Systems (TODS), 1997, 22 (2):171-214.
  • 8Snodgrass R T, Gao D, Zhang R, Thomas S W. Temporal support for persistent stored modules//Proceedings of the 1EEE International Conference on Data Engineering (ICDE). Washington, DC, USA, 2012.
  • 9Bodirsky M, Kara J. The cortxplexity of temporal constraint satisfaction problems//Proceedings of the 40th Annual ACM Symposium on Theory of Computing. Victoria, British Columbia, Canada, 2008:29-38.
  • 10Elmagarmid A K, Ipeirotis P G, Verykios V S. Duplicate record detection: A survey. Transactions on Knowledge and Data Engineering (TKDE), 2007, 19(1) : 1-16.

共引文献19

同被引文献8

引证文献1

二级引证文献2

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部