期刊文献+

有效解决数据缺失问题的聚集查询算法 被引量:2

Aggregation query processing algorithm for effective solving data missing problem
下载PDF
导出
摘要 近年来,工业界和学术界面临着非常严重的数据缺失问题,缺失值极大降低了数据可用性。现有的缺失值填充技术需要较大的时间开销,很难满足大数据查询实时性的需求,为此,研究在有缺失值的情况下高效处理聚集查询,将基于采样的近似聚集查询处理与缺失值填充技术有效的结合,快速返回满足用户需求的聚集结果。采用基于块(block-level)的采样策略,在采集到的样本上进行缺失值填充,并根据缺失值填充的结果重构得到聚集结果的无偏估计。真实数据集和合成数据集上的实验结果表明,该文的方法比当前最好的方法在保证相同精度的前提下,大大提升了查询效率。 Recently, both industrial and academic worlds suffer from the problem of incomplete data. Incomplete data (missing value) significantly reduces the value of data. Existing missing data imputation techniques with high time complexity hardly meet the requirements of real-time applications in the big data era. This paper focuses on how to efficiently evaluate aggregation queries on incomplete data. Specifically, missing data imputation techniques are integrated with the sample-based approximate query processing. Besides, a block-level sampling strategy is adoptd to speed up the query processing. All missing values are imputed in the sample and an unbiased estimator of the truth aggregate result is derived. Experiments on both real dataset and synthetic dataset show that the method can produce significant improvements in speed while providing good quality answer.
作者 孙舟 田贺平 潘鸣宇 王伟贤 张禄 陈光 SUN Zhou;TIAN Heping;PAN Mingyu;WANG Weixian;ZHANG Lu;CHEN Guang(State Grid Beijing Electric Power Company,Beijing 100075,China;NARI Group,Beijing 102299,China)
出处 《计算机工程与应用》 CSCD 北大核心 2018年第24期72-78,共7页 Computer Engineering and Applications
基金 国家电网公司科技项目
关键词 缺失值填充 聚集查询 块采样 incomplete data aggregate query block sampling
  • 相关文献

参考文献2

二级参考文献27

  • 1金连,王宏志,黄沈滨,高宏.基于Map-Reduce的大数据缺失值填充算法[J].计算机研究与发展,2013,50(S1):312-321. 被引量:18
  • 2霍然,王宏志,朱鎔,李建中,高宏.基于Map-Reduce的大数据实体识别算法[J].计算机研究与发展,2013,50(S2):170-179. 被引量:9
  • 3Fan W, Geerts F. Foundations of data quality management. Synthesis Lectures on Data Management, 2012, 4(5): 1-217.
  • 4Dumais S, Banko M, Brill E, et al. Web question answering: Is more always better? //Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Tampere , Finland, 2002: 291-298.
  • 5Li X, Meng W, Yu C. T-verifier: Verifying truthfulness of fact statements//Proceedings of the 2011 IEEE 27th International Conference on Data Engineering (lCDE). Hannover, Germany, 2011: 63-74.
  • 6Page L, Brin S, Motwani R, et al. The pagerank citation ranking: Bringing order to the web. Stanford InfoLab, California, USA: Technical Report: 422, 1999.
  • 7Grzymala-BusseJ W, Hu M. A comparison of several approaches to missing attribute values in data mining//Ziarko W, Yao Yiyu eds. Rough Sets and Current Trends in Computing. Lecture Notes in Computer Science 2005. Berlin Heidelberg: Springer, 2001: 378-385.
  • 8Li Z, Sharaf M A, Sitbon L, et al. WebPut: Efficient webbased data imputation//Wang X S, Cruz I, Delis A, Huang Guangyan eds. Web Information Systems Engineering-WISE 2012. Lecture Notes in Computer Science 7651. Berlin Heidelberg: Springer, 2012: 243-256.
  • 9LinJ. The web as a resource for question answering: Perspectives and challenges//Proceedings of the 3rd International Conference on Language Resource and Evaluation (LREC 2002). Las Palrnas , Spain. 2002.
  • 10Zhao S, Grishman R. Extracting relations with integrated information using kernel methods//Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics. Michigan, USA, 2005: 419-426.

共引文献48

同被引文献32

引证文献2

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部