期刊文献+

规则与概率相结合的不一致数据子集修复方法 被引量:1

Subset Repair Method Combining Rules and Probabilities for Inconsistent Data
下载PDF
导出
摘要 不一致数据子集修复问题是数据清洗领域的重要研究问题,现有方法大多是基于完整性约束规则的,采用最小删除元组数量原则进行子集修复.然而,这种方法没有考虑删除元组的质量,导致修复准确性较低.为此,提出规则与概率相结合的子集修复方法,建模不一致元组概率使得正确元组的平均概率大于错误元组的平均概率,求解删除元组概率和最小的子集修复方案.此外,为了减小不一致元组概率计算的时间开销,提出一种高效的错误检测方法,减小不一致元组规模.真实数据和合成数据上的实验结果验证所提方法的准确性优于现有最好方法. Subset repair for inconsistent data is an important research problem in the field of data cleaning.Most of the existing methods are based on integrity constraint rules and adopt the principle of the minimum number of deleted tuples for subset repair.However,these methods take no account of the quality of deleted tuples,and the repair accuracy is low.Therefore,this study proposes a subset repair method combining rules and probabilities.The probability of inconsistent tuples is modeled so that the average probability of correct tuples is greater than that of wrong tuples,and the optimal subset repair with the smallest sum of the probability of deleted tuples is calculated.In addition,in order to reduce the time overhead of calculating the probability of inconsistent tuples,this study proposes an efficient error detection method to reduce the size of inconsistent tuples.Experimental results on real data and synthetic data verify that the proposed method outperforms the state-of-the-art subset repair method in terms of accuracy.
作者 张安珍 司佳宇 梁天宇 朱睿 邱涛 ZHANG An-Zhen;SI Jia-Yu;LIANG Tian-Yu;ZHU Rui;QIU Tao(School of Computer Science,Shenyang Aerospace University,Shenyang 110136,China)
出处 《软件学报》 EI CSCD 北大核心 2024年第9期4448-4468,共21页 Journal of Software
基金 国家自然科学基金青年基金(62102271,62002245) 辽宁省教育厅基础研究项目(JYT2020027)。
关键词 不一致数据 函数依赖 子集修复 概率图网络 inconsistent data functional dependency subset repair probabilistic graph network
  • 相关文献

参考文献1

二级参考文献159

  • 1Redman T. The impact of poor data quality on the typical enterprise [J]. Communications of the ACM, 1998, 41(2) : 79-82.
  • 2Miller D W, Yeast J D, Evans R L. Missing prenatal records at a birth center: A communication problem quantified [C] // Proc of AMIA Annual Syrup Proceedings. Maryland: American Medical Informatics Association, 2005 : 535-539.
  • 3Swartz N. Gartner warns firms of 'dirty data' [J]. Information Management Journal, 2007, 41(3): 6.
  • 4Kohn L T, Corrigan J M, Donaldson M S. To Err is Human: Building a Safer Health System [M]. Washington: National Academies Press, 2000.
  • 5Eckerson W. Data Warehousing Special Report Data quality and the bottom line [R]. Applications Development Trends, 2002.
  • 6English L P. Improving Data Warehouse and Business Information Quality: Methods for Reducing Costs and Increasing Profits [M]. New York: Wiley, 1999.
  • 7Woolsey B, Schulz M. Credit card statistics, industry facts, debt statistics [OL]. [2013-04-20 ]. http://www. creditcards, com/credit-card-news/credit-card-indust ry-facts- personal-debt-statistics-1276, php.
  • 8Shilakes C, Tylman J. Enterprise information portals [R]. New York: Merrill Lynch, 1998.
  • 9Rahm E, Do H H. Data cleaning:Problems and current approaches [J]. IEEE Data Engineering Bulletin, 2000, 23 (4): 3-13.
  • 10Dong X L, Berti-Equille L, Srivastava D. Integrating conflicting data:The role of source dependence[J]. Proceedings of the VLDB Endowment, 2009, 2(1): 550-561.

共引文献259

同被引文献1

引证文献1

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部