针对大规模数据的分布一致缺失值插补算法被引量：3

Distribution consistency-based missing value imputation algorithm for large-scale data sets

导出

摘要缺失值插补(missing value imputation,MVI)作为数据挖掘领域的重要研究分支,旨在为机器学习算法的训练提供高质量的数据支持。不同于现有的以算法性能提升为导向的MVI算法,为对大规模数据的缺失值进行有效插补,该文提出一种以数据结构还原为导向的数据分布一致MVI(distribution consistency-based MVI, DC-MVI)算法。首先,DC-MVI算法基于概率分布一致性原则构建了用于确定最优插补值的目标函数;其次,利用推导出的可行缺失值优化规则获取与原始完整值保持最大分布一致性且方差最为接近的插补值;最后,在分布式环境下,针对大数据的随机样本划分(random sample partition, RSP)数据块并行训练DC-MVI算法,获得大规模数据缺失值对应的插补值。实验结果表明:DC-MVI算法不仅能生成与原始完整值保持给定显著性水平下概率分布一致的插补值,还具有比另外5种经典的和3种最新的MVI算法更快的插补速度和更好的插补效果,进而证实DC-MVI算法是一种可行的大规模数据MVI算法。 [Objective]As a significant research branch in the field of data mining,missing value imputation(MVI)aims to provide high-quality data support for the training of machine learning algorithms.However,MVI results for large-scale data sets are not ideal in terms of restoring data distribution and improving data prognosis accuracy.To improve the performance of the existing MVI algorithms,we propose a distribution consistency-based MVI(DC-MVI)algorithm that attempts to restore the original data structure by imputing the missing values for large-scale data sets.[Methods]First,the DC-MVI algorithm developed an objective function to determine the optimal imputation values based on the principle of probability distribution consistency.Second,the data set is preprocessed by random initialization of missing values and normalization,and a feasible missing value update rule is derived to obtain the imputation values with the closest variance and the greatest consistency with the complete original values.Next,in a distributed environment,the large-scale data set is divided into multiple groups of random sample partition(RSP)data blocks with the same distribution as the entire data set by taking into account the statistical properties of the large-scale data set.Finally,the DC-MVI algorithm is trained in parallel to obtain the imputation value corresponding to the missing value of the large-scale data set and preserve distribution consistency with the non-missing values.The rationality experiments verify the convergence of the objective function and the contribution of DC-MVI to distribution consistency.In addition,the effectiveness experiments assess the performance of DC-MVI and eight other MVI algorithms(mean,KNN,MICE,RF,EM,SOFT,GAIN,and MIDA)through the following three indicators:distribution consistency,time complexity,and classification accuracy.[Results]The experimental results on seven selected large-scale data sets showed that:1)The objective function of the DC-MVI method was effective,and the missing value update rule was feasible,allowing the imputation values to remain stable throughout the adjustment process;2)the DC-MVI algorithm obtained the smallest maximum mean discrepancy and Jensen-Shannon divergence on all data sets,showing that the proposed method had a more consistent probability distribution with the complete original values under the given significance level;3)the running time of the DC-MVI algorithm tended to be stable in the time comparison experiment,whereas the running time of other state-of-the-art MVI methods increased linearly with data volume;4)the DC-MVI approach could produce imputation values that were more consistent with the original data set compared to existing methods,which was beneficial for subsequent data mining analysis.[Conclusions]Considering the peculiarities and limitations of missing large-scale data,this paper incorporates RSP into the imputation algorithm and derives the update rules of imputation values to restore the data distribution and further confirm the effectiveness and practical performance of DC-MVI in the large-scale data set imputation,such as preserving distribution consistency and increasing imputation quality.The method proposes in this paper achieves the desired result and represents a viable solution to the problem of large-scale data imputation.

作者余嘉茵何玉林崔来中黄哲学 YU Jiayin;HE Yulin;CUI Laizhong;HUANG Zhexue(Big Data Institute,College of Computer Science&Software Engineering,Shenzhen University,Shenzhen 518060,China;Guangdong Laboratory of Artificial Intelligence and Digital Economy(SZ),Shenzhen 518107,China)

机构地区深圳大学计算机与软件学院广东省人工智能与数字经济实验室(深圳)

出处《清华大学学报（自然科学版）》 EI CAS CSCD 北大核心 2023年第5期740-753,共14页 Journal of Tsinghua University(Science and Technology)

基金国家自然科学基金面上项目(61972261) 广东省自然科学基金面上项目(2314050006683) 深圳市基础研究重点项目(JCYJ20220818100205012) 深圳市基础研究面上项目(JCYJ20210324093609026)。

关键词文字信息处理缺失值插补分布一致性最大均值差异大规模数据随机样本划分分布式计算 word information processing missing value imputation distribution consistency maximum mean discrepancy large-scale data random sample partition distributed computing

分类号 TP391.1 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献4

1Md.Bahadur Badsha,Rui Li,Boxiang Liu,Yang ILi,Min Xian,Nicholas EBanovich,Audrey Qiuyan Fu.Imputation of single-cell gene expression with an autoencoder neural network[J].Quantitative Biology,2020,8(1):78-94. 被引量：1
2孟小峰,慈祥.大数据管理:概念、技术与挑战[J].计算机研究与发展,2013,50(1):146-169. 被引量：2391
3黄哲学,何玉林,魏丞昊,张晓亮.大数据随机样本划分模型及相关分析计算技术[J].数据采集与处理,2019,34(3):373-385. 被引量：16
4何玉林,黄德发,戴德鑫,黄哲学.最大均方差异统计量的一般界[J].应用数学,2021,34(2):284-288. 被引量：2

二级参考文献175

1Nature. Big Data [EB/OL]. [2012-10-02]. http,//www. nature, com/news/specials/bigdata/index, html.
2Bryant R E, Katz R H, Lazowska E D. Big-Data computing : Creating revolutionary breakthroughs in commerce, science, and society [R]. [2012-10-02]. http:// www. cra. org/ccc/docs/init/Big_Data, pdf.
3Science. Special online collection: Dealing with data [EB/OL]. [2012-10-02]. http://www, sciencemag, org/site/ special/data/, 2011.
4Agrawal D, Bernstein P, Bertino E, et al. Challenges and opportunities with big data A community white paper developed by leading researchers across the United States [R/OL]. [2012-10-02]. http://cra, org/ccc/docs/init/bigdata whitepaper, pdf.
5Manyika J, Chui M, Brown B, et al. Big data: The next frontier for innovation, competition, and productivity [R/OL]. [ 2012-10-02 ]. http://www, mekinsey, corn/ Insights]MGI[Research/Teehnology _ and _ Innovation]Big _ data The next frontier for innovation.
6World Economic Forum. Big data, big impact: New possibilities for international development [R/OL]. [2012- 10-02]. http://www3, weforum, org/docs/WEF TC MFS BigDataBigImpact_Briefing 2012. pdf.
7Big Data Across the Federal Government [EB/OL]. [2012-10-02]. http://www, whitehouse, gov/sites/default/ files/microsites/ostp/big_data fact sheet_final_ 1. pdf.
8UN Global Pulse. Big Data for Development:Challenges Opportunities [R/OL]. [ 2012-10-02 ]. http://www. unglobalpulse, org/proj ects/BigDataforDevelopment.
9Times N Y. The age of big data fEB/OLd. [2012-10 -02]. http://www, nytimes, com/2012/02/12/sunday review/big- datas-impact in-the-world, html?pagewanted=all.
10Grobelnik M. Big-data computing: Creating revolutionary breakthroughs in commerce, science, and society [R/OL]. [2012-10 -02]. http://videolectures, net/cswc2012_grobelnik_ big_data/.

共引文献2405

1韩莹莹,钟专,褚月娇,康春阳,李东霓,王志佳,刘晓阳,张白羽.基于大数据智能化背景下神经病学实践教学体系构建的探索[J].中国实验诊断学,2023,27(8):1006-1009.
2李坪.大数据赋权正当性证成[J].中山大学法律评论,2020(1):3-21. 被引量：1
3孙昊鹏.大数据在新冠肺炎疫情中的应用和缺失[J].郑州师范教育,2020,9(3):91-96. 被引量：1
4闫妍.刍议大数据时代背景下全面预算管理对提升项目储备精益化管理水平的价值[J].质量与市场,2020,0(1):19-21. 被引量：6
5叶青.违法立案的检察监督机制研究[J].国家检察官学院学报,2024,32(1):53-68. 被引量：1
6刘厚营.大数据在安保工作情报分析中的应用[J].工程技术研究,2018,3(1):243-244. 被引量：1
7肖楠,陈红梅.从融媒体到智媒体:一种技术驱动下的传媒经济发展路径[J].新闻知识,2020(9):19-22. 被引量：3
8杨东,郑清洋.从TikTok事件看数字人民币的路径选择:从流量入口到金融优势的转化[J].新疆师范大学学报（哲学社会科学版）,2021,42(4):126-135. 被引量：6
9刘生龙,张晓明,杨竺松.互联网使用对农村居民收入的影响[J].数量经济技术经济研究,2021,38(4):103-119. 被引量：68
10李跃先,殷传涛,魏亿钢.基于本体与中间件的科技资源数据集成方法[J].标准科学,2021(5):21-28. 被引量：2

同被引文献34

1Isaac Kofi Nti,Juanita Ahia Quarcoo,Justice Aning,Godfred Kusi Fosu.A Mini-Review of Machine Learning in Big Data Analytics:Applications,Challenges,and Prospects[J].Big Data Mining and Analytics,2022,5(2):81-97. 被引量：4
2司俊鸿,陈开岩.基于Tikhonov正则化的矿井通风网络测风求阻法[J].煤炭学报,2012,37(6):994-998. 被引量：15
3王凤梅,胡丽霞.一种基于近邻规则的缺失数据填补方法[J].计算机工程,2012,38(21):53-55. 被引量：15
4刘泽功.利用通风系统调风和阻力测定求算复杂通风网路分支风阻[J].煤矿安全,1991(1):1-7. 被引量：8
5刘剑,李雪冰,陈廷凯,宋莹,赵春双.矿井定常湍流脉动对通风阻力测试影响的理论分析[J].中国安全生产科学技术,2016,12(5):22-25. 被引量：11
6金菊良,李辉,李靖,蒋尚明,张明.基于云模型的安徽省干旱时空分布特征分析[J].水电能源科学,2017,35(4):1-5. 被引量：6
7徐泽华,韩美.山东省干旱时空分布特征及其与ENSO的相关性[J].中国生态农业学报,2018,26(8):1236-1248. 被引量：27
8李雨成,李俊桥,邓存宝,刘蓉蒸.基于角联子网的风量反演风阻病态改良算法[J].煤炭学报,2019,44(4):1147-1153. 被引量：5
9赵辉,王玥,张旭东,马胜彬.基于云模型的特色小镇PPP项目融资风险评价[J].土木工程与管理学报,2019,36(4):81-88. 被引量：26
10果华雯,张元伟,宋小燕,张静,刘国庆.中国南北过渡带干旱时空变化[J].南水北调与水利科技（中英文）,2020,18(2):79-85. 被引量：8

引证文献3

1崔素芳,张保祥,荣燕妮,姜欣,王振,刘振,付军.基于云模型的山东省干旱时空分布特征[J].南水北调与水利科技（中英文）,2023,21(4):679-688. 被引量：2
2刘智,李涛,袁冲.面向销售数据的多项缺失值关联性的增量填补[J].计算机系统应用,2024,33(4):288-295.
3倪景峰,刘雪峰,邓立军.矿井通风参数缺失数据插补方法[J].煤炭学报,2024,49(5):2315-2323.

二级引证文献2

1谢红梅,凌金龙.临沂市水利工程建设管理的思考及建议[J].山东水利,2024(7):27-29.
2袁月,陈东东,杜成勋,廖伟,冉津江,栗晓玮.攀西地区芒果关键生育期干旱指标判识[J].生态学杂志,2024,43(8):2414-2420.

1刘子建,丁维龙,邢梦达,李寒,黄晔.Conv-WGAIN:面向多元时序数据缺失的卷积生成对抗插补网络模型[J].计算机工程与科学,2023,45(5):931-939.
2曹家勇,吴世豪,马千里.正骨机器人术前视觉导航方法与实验验证[J].科学技术与工程,2023,23(3):1111-1118.
3赖玉芳,王振友.一种使用最大均值差异方法的多因子进化算法[J].广东工业大学学报,2023,40(3):38-45. 被引量：1
4马劲松,宋春桥,王艳君,张大鹏.亚洲冰川湖泊分类与最大分布数据集(1980s-2019)[J].全球变化数据学报（中英文）,2022,6(2):200-208. 被引量：1
5李燕,王克强,吴晓男.创新成效、双向开放与经济高质量发展——基于长江经济带11省市的实证[J].江汉大学学报（社会科学版）,2023,40(3):94-104. 被引量：2
6武小波,李建林.基于云平台的批量话单快速解码方法[J].现代计算机,2023,29(8):108-112.
7王斌,汪玲,闫华,姜启兴,于沛沛,夏文水.不同聚合度壳寡糖单体在小鼠体内的吸收分布[J].食品科学,2023,44(1):115-123.
8邵海东,肖一鸣,颜深.仿真数据驱动的改进无监督域适应轴承故障诊断[J].机械工程学报,2023,59(3):76-85. 被引量：23
9P.Prabhu,P.Valarmathie,K.Dinakaran.A Feature Learning-Based Model for Analyzing Students’ Performance in Supportive Learning[J].Intelligent Automation & Soft Computing,2023(6):2989-3005.
10袁志洪,陈雨.基于LSTM⁃TCN的地下水位数据修复及应用[J].现代计算机,2023,29(8):20-26. 被引量：2

清华大学学报（自然科学版）

2023年第5期

浏览历史

内容加载中请稍等...

针对大规模数据的分布一致缺失值插补算法被引量：3

参考文献4

二级参考文献175

共引文献2405

同被引文献34

引证文献3

二级引证文献2

相关作者

相关机构

相关主题

浏览历史

针对大规模数据的分布一致缺失值插补算法 被引量：3

参考文献4

二级参考文献175

共引文献2405

同被引文献34

引证文献3

二级引证文献2

相关作者

相关机构

相关主题

浏览历史

针对大规模数据的分布一致缺失值插补算法被引量：3