摘要
随着数据来源的不断丰富,数据的获取变得愈发容易,但质量难以得到保证,从而导致缺失值在真实数据集中普遍存在且难以避免,缺失值填补也就成为数据质量管理领域的经典问题之一。目前,大多数的缺失值填补算法均是针对静态数据提出的,并不适用于高速到达的动态数据流,且现有算法大多未同时考虑数据的稀疏性和异构性问题。基于此,文中提出了一种新的基于独立模型的在线缺失值填补算法RIIM。该算法同时考虑了数据的稀疏性和异构性问题,并结合近邻填补和回归填补的基本思想对缺失值进行有效填补。首先,针对数据的动态实时性,提出了高效的填补模型增量更新算法;其次,针对数据近邻查找时间代价高以及近邻个数难以确定的问题,提出了最优近邻自适应周期性更新策略;最后基于真实数据集通过大量实验验证了所提算法的有效性。
With the enrichment of data sources,data can be obtained easily but with low quality,resulting that the MVs are ubi-quitous and hard to avoid.Consequently,MV imputation has become one of the classical problems in the field of data quality mana-gement.However,most existing MV imputation approaches are proposed for static data,which cannot handle dynamic data streams arriving at high-speed.Moreover,they do not consider data sparsity and heterogeneity simultaneously.Therefore,a novel MV imputation approach,real-time imputation based on individual models(RIIM)is proposed.In RIIM,the MVs are effectively filled by combining the basic ideas of neighbors-based imputation and regression-based imputation with consideration of sparsity and heterogeneity of data.For the dynamic and real time of data streams,the MV imputation model is updated incrementally.Moreover,an adaptive and periodic updating strategy for optimal neighbors search is proposed to solve the problem of high time cost and hard to determine the number of neighbors.Finally,the effectiveness of the proposed RIIM is evaluated based on extensive experiments over real-world datasets.
作者
李霞
马茜
白梅
王习特
李冠宇
宁博
LI Xia;MA Qian;BAI Mei;WANG Xi-te;LI Guan-yu;NING Bo(School of Information Science&Technology,Dalian Maritime University,Dalian,Liaoning 116026,China)
出处
《计算机科学》
CSCD
北大核心
2022年第8期56-63,共8页
Computer Science
基金
国家自然科学基金(62002039,61602076,61702072,61976032)
中国博士后科学基金面上项目(2017M611211,2017M621122,2019M661077)
辽宁省自然科学基金(20180540003)
赛尔网络下一代互联网技术创新项目(NGII20190902)
中央高校基本科研业务费(3132021239)。
关键词
缺失值
在线填补
数据流
稀疏性
异构性
Missing value
Real-time imputation
Data streams
Sparsity
Heterogeneity