期刊文献+

基于关联数据的一致性和时效性清洗方法 被引量:42

A Cleaning Method for Consistency and Currency in Related Data
下载PDF
导出
摘要 数据一致性和数据时效性是大数据质量管理所关注的两个重要内容.条件函数依赖(CFDs)和时效约束(CCs)分别是用于分析数据一致性和数据时效性的有效技术手段.现实生活中的数据会夹杂一些关于一致性和时效性的潜在错误,这些错误又无法为CFDs和CCs检测和修复,最终影响数据的整体质量.值得一提的是,这些数据通常是相互关联的,这种关联关系可以用来发现数据中的潜在错误.文中使用了一种条件合并的函数依赖(CCFDs)将关联数据放在一起进行处理.基于此,该文提出了一种基于关联数据的一致性和时效性清洗方法.在数据清洗过程中,数据的检测和修复是两个相互影响的过程.所以,该文设计了一种新的自动清洗框架,迭代地进行数据检测和数据修复.其次,该文对关联数据的一致性和时效性清洗的相关问题进行了分析,并且证明了关于CCFDs和CCs的最小代价修复问题是一个Σ~p_2完全(NP^(NP))问题.进而,该文采用一种启发式的修复方法对错误进行修复.为了提高修复的准确性,该文还提出了一种修复序列图的概念.最后,通过在两组真实数据上进行实验,验证了方法的实用性和高效性. Data consistency and data currency are critical issues of big data quality management.Conditional functional dependencies(CFDs)and currency constraints(CCs)are two of techniques which analyzes data consistency and data currency.However,data in real world is always mixed with potential inconsistent and non-current errors which cannot be detected by the existing methods,even be intractable to be repaired.It results in low-quality data.Note that,the content expressed by these real-life data are related to each other.And this association contributes to discovering potential errors existing in data.To solve this problem,we employ conditioncombined functional dependencies(CCFDs)which put related data together in error detection.In this paper,we propose a cleaning method for consistency and currency in related data.In practice,the detection and the repairing of data cleaning are interactive.A accuracy detection will provide a high-quality basis for repairs.As well the results of the repairs will feed back to the detection.Hence,we design an automatic cleaning framework which detects and repairs data errors iteratively.Futhermore,we discuss the fundamental problems of data cleaning mixed with consistency and currency.We prove that the problem of minimum repairing cost using CCFDs and CCs is Σ~p2-complete(NP^(NP))so that we propose a heuristic repairing method which computes the minimumcost target values for repairing the errors in each iterations.Otherwise,to improve the precision of data repairing,we present Repairing Sequences Graph.It calculates the errors which should be repaired preferentially.Our solution is approved more effective and efficient,even evidenced by our empirical evaluation on two real-life datasets.
出处 《计算机学报》 EI CSCD 北大核心 2017年第1期92-106,共15页 Chinese Journal of Computers
基金 国家"九七三"重点基础研究发展规划项目基金(2012CB316200 2012CB316201) 国家自然科学基金(61033007 61472070 61672142) 中央高校基本科研业务费专项资金(N150408001-3 N150404013)资助~~
关键词 数据一致性 数据时效性 大数据质量 关联数据 数据清洗 data consistency data currency big data quality related data data cleaning
  • 相关文献

参考文献3

二级参考文献30

  • 1金连,王宏志,黄沈滨,高宏.基于Map-Reduce的大数据缺失值填充算法[J].计算机研究与发展,2013,50(S1):312-321. 被引量:18
  • 2霍然,王宏志,朱鎔,李建中,高宏.基于Map-Reduce的大数据实体识别算法[J].计算机研究与发展,2013,50(S2):170-179. 被引量:9
  • 3Eckerson W W. Data quality and the bottom line: Achieving business success through a commitment to high quality data. Data Warehousing Institute: Technical Report TDWI Report Series, 2002.
  • 4Zhang H, Diao Y, Immerman N. Recognizing patterns in streams with imprecise timestamps. Proceedings of the VLDB Endowment, 2010, 3(1-2): 244-255.
  • 5Fan W, Geerts F, Wijsen J. Determining the currency of data//Proceedings of the ACM Symposium on Principles of Database Systems(PODS). Athens, Greece, 2011:71-82.
  • 6Berti-EquiUe L, Sarma A D, Dong X, Marian A, Srivastava D.Sailing the information ocean with awareness of currents: Discovery and application of source dependence//Proceedings of the Conference on Innovative Data Systems Research (CIDR). Asilomar, CA, USA, 2009.
  • 7Dong X, Berti-Equille L, Hu Y, Srivastava D. Global detec- tion of complex copying relationships between sources. Pro- ceedings of the VLDB Endowment, 2010, 3(1 2) : 1358-1369.
  • 8Dong X, Berti-Equille L, Srivastava D. Truth discovery and copying detection in a dynamic world. Proceedings of the VLDB Endowment, 2009, 2(1) : 562-573.
  • 9Clifford J, Dyreson C E, Isakowitz T, Jensen C S, Snodgrass R T. On the semantics of "now" in databases. ACM Transactions on Database Systems (TODS), 1997, 22 (2):171-214.
  • 10Snodgrass R T, Gao D, Zhang R, Thomas S W. Temporal support for persistent stored modules//Proceedings of the 1EEE International Conference on Data Engineering (ICDE). Washington, DC, USA, 2012.

共引文献66

同被引文献442

引证文献42

二级引证文献228

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部