期刊文献+

一种基于确定度的交互式迭代数据清洗方法

An interactive iterative data cleaning method based on certainty
下载PDF
导出
摘要 自动化的数据清洗技术可以极大地提升数据清洗的效率,但会导致一定的错误率和不可靠的结果,通过引入人的参与,对建议修改值进行检查可避免错误的修改,同时对最终结果的可靠性有直观的评估。基于上述考虑,本文提出了一种基于确定度的交互式迭代清洗方法,该方法利用主动学习技术,将基于统计方法的数据清洗技术和人的参与相结合,在迭代过程中不断提升清洗模型的清洗能力和数据质量,同时最小化人的参与度。具体地,此方法包含一个基于确定度的自动清洗模型,对数据是否需要修改的必要性进行度量,可有效减少错误的修复;此外,本文还定义了确定度增益,表示数据是保留、还是修改的分歧程度,将分歧最大的建议修改值交与人查看,以减小人的参与度。最终,本文在多个实验数据上验证了方法的有效性。 Automated data cleaning technology can greatly improve the efficiency of data cleaning,but it will lead to a certain error rate and unreliable results.By introducing people′s participation,it can avoid the wrong modification by checking the recommended modification value,and the reliability of the final result can be evaluated intuitively.Based on the above consideration,this paper proposes a cleaning interactive iteration method based on certainty,using active learning techniques,this method will apply data cleaning technology based on the statistical methods in combination with the participation of people,and in the process of iteration enhance cleaning ability of the cleaning model and data quality,thereafter minimize the engagement of the people at the same time.Specifically,this method includes an automatic cleaning model based on the certainty,and measures the necessity of whether the data needs to be modified,which can effectively reduce the error repair.In addition,this paper also defines the certainty gain,indicating the degree of divergence between data retention and data modification,and submits the suggested modified values with the largest divergence to people for review,so as to reduce engagement.Finally,the validity of the method is verified by several experimental data.
作者 孙辞海 王洪亚 郭开彦 程炜东 SUN Cihai;WANG Hongya;GUO Kaiyan;CHENG Weidong(College of Computer Science and Technology,Donghua University,Shanghai 201620,China;School of Statistics and Information,Shanghai University of International Business and Economics,Shanghai 201620,China)
出处 《智能计算机与应用》 2023年第8期1-10,共10页 Intelligent Computer and Applications
基金 国家自然科学基金(61370205) 上海市自然科学基金项目(13ZR1400800)。
关键词 数据清洗 主动学习 确定度 交互式迭代 data cleaning active learning certainty interactive iterative
  • 相关文献

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部