摘要
数据一致性和数据时效性是大数据质量管理所关注的两个重要内容.条件函数依赖(CFDs)和时效约束(CCs)分别是用于分析数据一致性和数据时效性的有效技术手段.现实生活中的数据会夹杂一些关于一致性和时效性的潜在错误,这些错误又无法为CFDs和CCs检测和修复,最终影响数据的整体质量.值得一提的是,这些数据通常是相互关联的,这种关联关系可以用来发现数据中的潜在错误.文中使用了一种条件合并的函数依赖(CCFDs)将关联数据放在一起进行处理.基于此,该文提出了一种基于关联数据的一致性和时效性清洗方法.在数据清洗过程中,数据的检测和修复是两个相互影响的过程.所以,该文设计了一种新的自动清洗框架,迭代地进行数据检测和数据修复.其次,该文对关联数据的一致性和时效性清洗的相关问题进行了分析,并且证明了关于CCFDs和CCs的最小代价修复问题是一个Σ~p_2完全(NP^(NP))问题.进而,该文采用一种启发式的修复方法对错误进行修复.为了提高修复的准确性,该文还提出了一种修复序列图的概念.最后,通过在两组真实数据上进行实验,验证了方法的实用性和高效性.
Data consistency and data currency are critical issues of big data quality management.Conditional functional dependencies(CFDs)and currency constraints(CCs)are two of techniques which analyzes data consistency and data currency.However,data in real world is always mixed with potential inconsistent and non-current errors which cannot be detected by the existing methods,even be intractable to be repaired.It results in low-quality data.Note that,the content expressed by these real-life data are related to each other.And this association contributes to discovering potential errors existing in data.To solve this problem,we employ conditioncombined functional dependencies(CCFDs)which put related data together in error detection.In this paper,we propose a cleaning method for consistency and currency in related data.In practice,the detection and the repairing of data cleaning are interactive.A accuracy detection will provide a high-quality basis for repairs.As well the results of the repairs will feed back to the detection.Hence,we design an automatic cleaning framework which detects and repairs data errors iteratively.Futhermore,we discuss the fundamental problems of data cleaning mixed with consistency and currency.We prove that the problem of minimum repairing cost using CCFDs and CCs is Σ~p2-complete(NP^(NP))so that we propose a heuristic repairing method which computes the minimumcost target values for repairing the errors in each iterations.Otherwise,to improve the precision of data repairing,we present Repairing Sequences Graph.It calculates the errors which should be repaired preferentially.Our solution is approved more effective and efficient,even evidenced by our empirical evaluation on two real-life datasets.
出处
《计算机学报》
EI
CSCD
北大核心
2017年第1期92-106,共15页
Chinese Journal of Computers
基金
国家"九七三"重点基础研究发展规划项目基金(2012CB316200
2012CB316201)
国家自然科学基金(61033007
61472070
61672142)
中央高校基本科研业务费专项资金(N150408001-3
N150404013)资助~~