期刊文献+

不一致数据最大概率子集修复算法

Maximum probability subset repair algorithm for inconsistent data
下载PDF
导出
摘要 针对关系型数据中的不一致错误,现有子集修复方法通常以最小删除元组数量为优化目标求解最优修复方案,以减少对原始数据的更改。但当数据中的错误较多时,该方法的准确率将降低。提出了一种最大概率子集修复方法,利用属性之间的关联关系及概率统计信息对元组的正确性概率进行建模,将最小删除元组的正确性概率之和作为优化目标进行最优子集修复,并给出了高效的最大概率子集修复近似算法。真实数据集和合成数据集上的实验结果表明,最大概率子集修复方法的准确率优于当前最好方法。 For inconsistency errors in relational data,existing subset repair methods usually take the minimum number of deleted tuples as the optimization goal to find the optimal repairing scheme to reduce the changes to the original data.However,when there are more errors in the data,the accuracy of the method will be greatly reduced.To this end,a maximum probability subset repair method was proposed,which used the relationship between attributes and probability and statistical information to model the correctness probability of tuples.The sum of the correctness probability of the minimum deleted tuple was taken as the optimization goal to solve the optimal subset repair,and an efficient maximum probability subset repair approximation algorithm was given.Experimental results on real datasets and synthetic datasets show that the maximum probability subset repair method outperforms the current state-of-the-art method in accuracy.
作者 夏秀峰 司佳宇 张安珍 XIA Xiu-feng;SI Jiayu;ZHANG An-zhen(College of Computer Science,Shenyang Aerospace University,Shenyang 110136,China)
出处 《沈阳航空航天大学学报》 2023年第1期48-57,共10页 Journal of Shenyang Aerospace University
基金 国家自然科学基金(项目编号:62102271)。
关键词 不一致数据 最大概率 子集修复 数据清洗 机器学习 inconsistent data maximum probability subset repair data cleaning machine learning
  • 相关文献

参考文献3

二级参考文献185

  • 1Apache官方主页[DB/OL].http://Hadoop.apache.org/.
  • 2White T.Hadoop权威指南(第二版)[M].北京:清华大学出版社,2011:43—44.
  • 3Redman T. The impact of poor data quality on the typical enterprise [J]. Communications of the ACM, 1998, 41(2) : 79-82.
  • 4Miller D W, Yeast J D, Evans R L. Missing prenatal records at a birth center: A communication problem quantified [C] // Proc of AMIA Annual Syrup Proceedings. Maryland: American Medical Informatics Association, 2005 : 535-539.
  • 5Swartz N. Gartner warns firms of 'dirty data' [J]. Information Management Journal, 2007, 41(3): 6.
  • 6Kohn L T, Corrigan J M, Donaldson M S. To Err is Human: Building a Safer Health System [M]. Washington: National Academies Press, 2000.
  • 7Eckerson W. Data Warehousing Special Report Data quality and the bottom line [R]. Applications Development Trends, 2002.
  • 8English L P. Improving Data Warehouse and Business Information Quality: Methods for Reducing Costs and Increasing Profits [M]. New York: Wiley, 1999.
  • 9Woolsey B, Schulz M. Credit card statistics, industry facts, debt statistics [OL]. [2013-04-20 ]. http://www. creditcards, com/credit-card-news/credit-card-indust ry-facts- personal-debt-statistics-1276, php.
  • 10Shilakes C, Tylman J. Enterprise information portals [R]. New York: Merrill Lynch, 1998.

共引文献368

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部