摘要
在数据质量研究中函数依赖被广泛用于关系数据不一致性的修复.然而,不一致修复问题面临的一个主要挑战是如何从包含有错误的关系数据中自动发现有效的函数依赖(Functional Dependence,FD).目前基于统计度量置信度的FD自动发现方法经常找出大量近似成立但无效的FD.如果直接利用这些FD修复数据,会产生更多错误.针对该问题,文中提出了一种基于数据语义分析的函数依赖检测方法.该方法通过条件概率来分析属性值和元组的数据置信度,进而计算函数依赖成立的置信度.文中同时提出了利用关系数据构建马尔科夫毯贝叶斯网络用以计算数据置信度的方法.最后文中通过实验在模拟数据和真实数据上验证了基于数据语义的置信度计算方法在自动检测中的精确度优于基于统计的计算方法,并且在交互式检测应用场景中数据语义的置信度所需用户工作量少于基于统计的方法.
In data quality research,Functional Dependencies(FDs)have been widely used to repair inconsistent relational data.However,the main challenge of repairing inconsistent data is how to discover valid functional dependencies from errorous relational data.The existing FD discovery methods,which are based on statistical confidence measurement,usually find many approximately correct but actually invalid FDs.Directly applying these discovered FDs to repair inconsistent relational data may introduce more data errors.To address this issue,we propose a novel approach for FD confidence measurement based on data semantics analysis.It first uses conditional probabilities to measure confidence of an attribute value,and then aggregate them for estimating the confidence level of a given FD.We also provide an efficient method to construct Markov blanket Bayesian networks for every relational data attribute,and then use Markov blanket Bayesian networks to compute conditional probabilities.Our experimental study on both synthetic and real-world data shows that the proposed approach achieves considerably higher accuracy than the statistics-based approach.Furthermore,we designed an interactive application scenario that each iteration consults user on verifying the FDs with highest confidence.Our experiment results also show our approach requires fewer manual works than statistics-based approach in interactive application scenario.
出处
《计算机学报》
EI
CSCD
北大核心
2017年第1期207-222,共16页
Chinese Journal of Computers
基金
国家"九七三"重点基础研究发展规划项目基金(2012CB316203)
国家自然科学基金(61332006
61472321)
西北工业大学基础研究基金(3102014JSJ0013
3102014JSJ0005)资助~~
关键词
数据质量
函数依赖
数据置信度
条件概率
data quality
functional dependency
data confidence
conditional probability