期刊文献+

基于Spark的序列数据质量评价 被引量:1

Evaluation of Sequential Data Quality Using Spark
下载PDF
导出
摘要 随着序列数据在实际中的广泛应用,序列数据质量评价成为学术、工业等众多领域的热门研究问题。目前主流的序列数据质量评价方法是基于概率后缀树模型进行数据质量评价,然而这种方法难以实现对大规模数据的处理。为解决此问题,提出了基于Spark的序列数据质量评价算法STALK(sequential data quality evaluation with Spark),并且采用了改进的剪枝策略来提高算法效率。具体地,在Spark平台下,利用大规模序列数据高效建立生成模型,并根据生成模型对查询序列的数据质量进行快速评价。最后通过真实序列数据集验证了STALK算法的有效性、执行效率和可扩展性。 Sequential data are prevalent in many real world applications.The quality evaluation on sequential data,which attracts the attentions from both academic research and industry fields,is important and prerequisite for extracting knowledge from the sequential data.Recently,a method using the probabilistic suffix tree has been proposed for evaluating the sequential data quality.However,this method cannot deal with the large-scale data set.To break this limitation,this paper proposes a Spark-based algorithm,called STALK(sequential data quality evaluation with Spark),for evaluating the quality of large-scale sequential data.Moreover,this paper uses the novel pruning strategies to improve the efficiency of STALK.Specifically,on the Spark platform,the large-scale sequential data are efficiently used to generate model,and the data quality of query sequence can be evaluated according to the generated model rapidly.Experiments on real-world sequential data sets demonstrate that STALK is effective,efficient and scalable.
作者 韩超 段磊 邓松 王慧锋 唐常杰 HAN Chao;DUAN Lei;DENG Song;WANG Huifeng;TANG Changjie(School of Computer Science, Sichuan University, Chengdu 610065, China;West China School of Public Health, Sichuan University, Chengdu 610041, China;Institute of Advanced Technology, Nanjing University of Posts and Telecommunications, Nanjing 210003, China)
出处 《计算机科学与探索》 CSCD 北大核心 2017年第6期897-907,共11页 Journal of Frontiers of Computer Science and Technology
基金 国家自然科学基金Nos.61572332 51507084 中国博士后科学基金Nos.2016T90850 2016M591890 中央高校基本科研业务费专项资金No.2016SCU04A22~~
关键词 数据质量 概率后缀树 SPARK 并行计算 data quality probabilistic suffix tree Spark parallel computing
  • 相关文献

参考文献6

二级参考文献77

  • 1Koutrika G, Bercovitz B, Ikeda R, et al. Social systems: Can we do more than just poke friends?[C]//Proeeedings of 4th Biennial Conterrence on Innovative Data Systems Research,Asilomar ,GA,USA,Jamuary 4-7, 2009.
  • 2Golder S,Hubernan B A .Usage patterns of collaborative tagging system[J].Journal of Information Science , 2006.32( 2 ).
  • 3Nie Z, Zhang Y,Wen J R ,et al.Object-level ranking:Bringing order to Web objects[C]//Proceedings of the 14th international International Conference on World Wide Web, ACM, NN. USA. New Press, 2005:567-574.
  • 4Jeh G,Widom J .Searling personalized Web search[C]//Proceedings of the 12th International Conference on World Wide Web, ACM, NY, USA. New York: ACM Press, 2003:271-279.
  • 5Page L, Brin S, Motwani R, et al. The PageRank citation ranking: Bringing order to the Web, SIDL-WP-1999-0120[R]. Stanford Digital Library Technologies Project, 1999.
  • 6Bianchini M, Gori M, Scarselli F. Inside PageRank[J]. ACM Transactions on Internet Technology, 2005,5 ( 1 ) : 92-128.
  • 7Cormen T H, Leiserson C E, Rivest R L, et al. Introduction to algorithms[M]. 2nd ed. [S.l.]: The MIT Press and McGraw- Hill Book Company, 2001 : 549-551.
  • 8Aebi, D., Perrochon, L. Towards improving data quality. In: Sarda, N.L., ed. Proceedings of the International Conference on Information Systems and Management of Data. Delhi, 1993. 273~281.
  • 9Wang, R.Y., Kon, H.B., Madnick, S.E. Data quality requirements analysis and modeling. In: Proceedings of the 9th International Conference on Data Engineering. Vienna: IEEE Computer Society, 1993. 670~677.
  • 10Rahm, E., Do, H.H. Data cleaning: problems and current approaches. IEEE Data Engineering Bulletin, 2000,23(4):3~13.

共引文献324

同被引文献14

引证文献1

二级引证文献5

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部