流数据Top-K关键字查询算法

Algorithm for Top-K Keyword Query in Data Streams

下载PDF

导出

摘要基于Spark Streaming计算框架的分布式Top-K关键字查询是统计流数据中所有关键字的热点研究问题。多数研究通过限定存储空间来实现Top-K关键字查询,并假设关键字集合已知。针对这个问题,提出一种可应用于关键字集合未知情况的分布式Top-K关键字查询算法,根据监测到的关键字动态地调整存储空间,通过更新策略的优化提升其精度。实验结果表明,该算法的性能在关键字集合未知的情况下比现有算法更优。 Distributed Top-K keyword query based on the framework of Spark Streaming is a hot research issue. It is used to count all keywords in data streams. Most studies of Top-K keyword query limit storage space and assume that the keywords set is known. To solve this problem, we presented a distributed Top-K keyword query algorithm which can be used in cases where the keywords set is unknown. This algorithm dynamically adjusts the size of storage space according to monitored keywords and optimizes the updated strategy to improve precision. Experimental results show that the proposed algorithm under the condition of unknown keywords set has better performance.

作者郑诗敏秦小麟刘亮周倩 ZHENG Shi-min QIN Xiao-lin LIU Liang ZHOU Qian(College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing 210016, China)

机构地区南京航空航天大学计算机科学与技术学院

出处《计算机科学》 CSCD 北大核心 2016年第8期142-147,共6页 Computer Science

基金国家自然科学基金项目(61373015 61300052) 江苏高校优势学科建设工程资助项目(PAPD) 江苏省重大科技成果转化基金项目(BA2013049)资助

关键词 Top-K关键字查询流数据云计算 SPARK STREAMING Top-K keyword query,Data streams,Cloud computing, Spark streaming

分类号 TP311 [自动化与计算机技术—计算机软件与理论]

引文网络
相关文献

参考文献19

1Chen L, Cong G, Cao X, et al. Temporal spatial-keyword top-k puhlish/subscribe[C] // 2015 IEEE 31st International Confe- rence on Data Engineering. IEEE,2015:255-266.
2Zheng K,Su H,Zheng B,et al. Interactive top-k spatial keyword queries[C]//2015 IEEE 31st International Conference on Data Engineering. IEEE, 2015 : 423-434.
3Charikar M, Chen K, Farach-Colton M. Finding Frequent Items in Data Streams[J]. Theoretical Computer Science, 2004, 312 (1) .. 1530-1541.
4Metwally A, Agrawal D, Abbadi A E. Efficient Computation of Frequent and Top-k Elements in Data Streams [C]//Interna- tional Conference on Database Theory. Springer-Verlag, 2005: 398-412.
5Zaharia M, Das T, Li H, et al. Discretized streams: fault-tolerant streaming computation at scale [C]//Twenty-Fourth ACM Symposium on Operating Systems Principles. 2013..423-438.
6Dean J, Ghemawat S. MapReduce: simplified data processing on large clusters[J]. Commun. ACM (CACM), 2008,51 ( 1 ) .. 107- 113.
7慈祥,马友忠,孟小峰.一种云环境下的大数据Top-K查询方法[J].软件学报,2014,25(4):813-825. 被引量：17
8宋杰,郝文宁,陈刚,靳大尉,赵水宁.基于MapReduce的分布式ETL体系结构研究[J].计算机科学,2013,40(6):152-154. 被引量：9
9Manku G,Motwani R. Approximate frequency counts over data streams[C]//Proceedings of the 28th International Conference on Very Large Data Bases. 2002:346-357.
10Demaine E D, Lopez-Ortiz A, Munro J I. Frequency estimation of internet packet streams with limited space[C]//Proceedings of the 10th Annual European Symposium on Algorithms. 2002: 348-360.

二级参考文献26

1陈伟江,郭朝珍.分布式ETL中协同机制的研究与设计[J].通信学报,2006,27(11):177-182. 被引量：10
2Fagin R. Combining fuzzy information from multiple systems. Journal of Computer and System Sciences, 1999,58(1):83-99. [doi: 10.1006/jcss.1998.1600].
3Fagin R, Lotem A, Naor M. Optimal aggregation algorithms for middleware. Journal of Computer and System Sciences, 2003,66(4) 614-656. [doi: 10.1016/S0022-0000(03)00026-6].
4Guntzer U, Balke W, KieBling W. Towards efficient multi-feature queries in heterogeneous environments. In: Proc. of the Int'l Conf. on Information Technology: Coding and Computing (ITCC 2001). Piscataway: IEEE, 2001. 622-628. [doi: 10.1109/ITCC. 2001.918866].
5Chang KCC, Hwang SW. Minimal probing: Supporting expensive predicates for top-k queries. In: Proc. of the SIGMOD Int'l Conf. on Management of Data. New York: ACM Press, 2002. 346-357. [doi: 10.1145/564691.564731].
6Bruno N, Chaudhuri S, Gravano L. Top-K selection queries over relational databases: Mapping strategies and performance evaluation. ACM Trans. on Database Systems, 2002,27(2): 153-187. [doi: 10.1145/568518.568519].
7Ilyas IF, Aref WG, Elmagarmid AK. Supporting top-k join queries in relational databases. In: Proc. of the 29th Int'l Conf. on Very Large Databases. San Fransisco: Morgan Kaufmann Publishers, 2003. 207-221. [doi: 10.1007/s00778-004-0128-2].
8Vlachou A, Doulkeridis C, Kotidis Y, Norvag K. Reverse top-k queries. In: Proc. of the 26th IEEE Int'l Conf. on Data Engineering. Piscataway: IEEE, 2010. 365-376. [doi: 10. 1109/ICDE.2010.5447890].
9Vlaehou A, Doulkeridis C, Kotidis Y, Norvag K. Monochromatic and bichromatic reverse top-k queries. IEEE Trans. on Knowledge and Data Engineering, 2011,23(8):1215-1229. [doi: 10.1109/TKDE.2011.50].
10Marian A, Bruno N, Gravano L. Evaluating top-k queries over Web-accessible databases. ACM Trans. on Database Systems, 2004, 29(2):319-362. [doi: 10.1145/1005566.1005569].

共引文献24

1刘豹.一种分布式ETL工具的设计与实现[J].软件,2013,34(10):73-77. 被引量：6
2李晓飞.云计算环境下Apriori算法的MapReduce并行化[J].长春工业大学学报,2013,34(6):736-740. 被引量：3
3余伟,李石君,杨莎,胡亚慧,刘晶,丁永刚,王骞.Web大数据环境下的不一致跨源数据发现[J].计算机研究与发展,2015,52(2):295-308. 被引量：24
4靳永超,吴怀谷.基于Neo4j处理大数据中元数据溯源的研究[J].现代计算机（中旬刊）,2015(3):61-64. 被引量：3
5陈钦荣,刘顺来.基于Top-k查询算法改进的储存与NSDL调度算法研究[J].现代计算机（中旬刊）,2015(5):28-32.
6罗恩韬,王国军.大数据中一种基于语义特征阈值的层次聚类方法[J].电子与信息学报,2015,37(12):2795-2801. 被引量：8
7蒋鸿玲,张楠,李克,田昊,葛伟.基于MapReduce的出租车停泊点智能推荐算法[J].计算机应用与软件,2016,33(2):254-258. 被引量：3
8余放,陈盛双,李石君,余伟.大数据环境下的多源数据演化更新研究[J].计算机科学,2016,43(12):189-194. 被引量：6
9陈盛双,何丹,王叔宝.大数据环境下的多源数据演化更新研究[J].汉口学院学报,2016,9(4):33-38.
10常成.PDMiner平台的主动配电网安全监测系统[J].哈尔滨理工大学学报,2017,22(2):61-66. 被引量：6

1刘舒佳.探索信息化,IT不止步[J].信息方略,2010(24):54-57.
2临界冰点.巧妙优化提升电脑3D性能[J].电脑迷,2007,0(5):42-42.
3何昱锋.调整优化提升信息系统运行可靠性[J].云南电业,2014(8):26-27.
4卢东明.出租车司机给出的IT启示[J].软件和信息服务,2013(12):82-82.
5张静娴,梁勇,张海清,林碧怡.优化提升存储系统性能技术探讨[J].通讯世界（下半月）,2014(4):25-26.
6唐晓勇.统整项目课程再解读[J].中国信息技术教育,2016(13):35-35. 被引量：1
7张薇.行为优化提升软件配置管理[J].计算机时代,2012(10):67-69.
8《信息安全研究》期刊简介[J].计算机研究与发展,2016,53(2):430-430.
9郑武.搜狗输入法能斗图了看谁还敢一言不合[J].计算机与网络,2016,42(13):37-37.
10武旭妹,侯健,王世梁.基于.NET的教务信息内容管理系统的研究与实现[J].电脑知识与技术,2016,0(1):106-107. 被引量：6

计算机科学

2016年第8期

浏览历史

内容加载中请稍等...

流数据Top-K关键字查询算法

参考文献19

二级参考文献26

共引文献24

相关作者

相关机构

相关主题

浏览历史