Web大数据系统数据源选择

Data Source Selection for Web Big Data System

下载PDF

导出

摘要如何从数量众多的Web数据源集合中选择数量合适的数据源,使得在满足特定查询需求的前提下尽可能地减少访问数据源的数量,是Web大数据系统集成中的关键问题之一。提出了一个两阶段数据源选择方案:第一阶段通过各个数据源模式与中间模式的相似度选择与查询相关度高的数据源,通过计算依赖数据源的质量来选取质量较好的数据源;第二阶段基于最大熵理论计算数据源之间的重复率,设计实现了一个查询最小代价模型动态选择数据源算法。最后在实验平台上对算法进行了评估,实验表明该算法具有较高的效率与扩展性。 How to select the appropriate data source from the large number of Web data sources,so as to reduce the number of accessing data sources,is one of the key issues in the integration of Web big data system.This paper proposes a two-stage data source selection method.The first stage is to select the data source with the high similarity to the middle schema and select the data source with the high reliability by computing the quality of dependent data source.In the second stage,a time-cost minimization query algorithm is designed for source permutation.To calculate the repetition rate of the data source,the maximum entropy theory is applied in the algorithm.Finally,the algorithmis evaluated on the experimental platform.The experiments show that the proposed algorithm has high efficiency and scalability compared with other algorithms.

作者刘正涛王建东 LIU Zhengtao;WANG Jiandong(College of Computer Science and Engineering, Sanjiang University, Nanjing 210012, China;College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing 210016, China)

机构地区三江学院计算机科学与工程学院南京航空航天大学计算机科学与技术学院

出处《计算机科学与探索》 CSCD 北大核心 2018年第3期360-369,共10页 Journal of Frontiers of Computer Science and Technology

基金国家自然科学基金 No.61139002~~

关键词 WEB 大数据数据源选择数据源质量数据源依赖 Web big data data source selection quality of data source dependence of data source

分类号 TP311 [自动化与计算机技术—计算机软件与理论]

引文网络
相关文献

参考文献2

1刘正涛,王建东.Web数据空间边建边用模式集成[J].计算机科学与探索,2011,5(1):87-96. 被引量：2
2万常选,邓松,刘德喜,江腾蛟,刘喜平.面向混合类型关键词查询的非合作结构化深网数据源选择[J].计算机研究与发展,2014,51(4):905-917. 被引量：6

二级参考文献22

1杜小勇,李曼,王珊.本体学习研究综述[J].软件学报,2006,17(9):1837-1847. 被引量：242
2唐杰,梁邦勇,李涓子,王克宏.语义Web中的本体自动映射[J].计算机学报,2006,29(11):1956-1976. 被引量：98
3刘强,黄涛,刘绍华,钟华.An Ontology-Based Approach for Semantic Conflict Resolution in Database Integration[J].Journal of Computer Science & Technology,2007,22(2):218-227. 被引量：4
4Milad S. Central-rank-based collection selection in uncooperative distributed information retrieval [C] //Proc of the 29th European Conf on IR Research. Berlin: Springer, 2007:160-172.
5Thomas P, Shokouhi M. SUSHI: Scoring scaled samples for server selection [C] //Proc of the 32nd Int ACM SIGIR Conf on Research and Development in Information Retrieval. New York: ACM, 2009:419-426.
6Ipeirotis P G, Gravano L, Sahami M. Probe, count and classify: Categorizing hidden Web databases [C]//Proc of the ACM SIGMOD Int Conf on Management of Data. New York: ACM, 2001:21-24.
7Hong D, Si L, Bracke P, et ah A joint probabilistic classification model for resource selection [C] //Proc of the 33rd Int ACM SIGIR Conf on Research and Development in Information Retrieval. New York: ACM, 2010:98-105.
8Liu V Z, Luo R C, Chu W W. Dpro: A probabilistic approach for hidden Web database selection using dynamic probing [C] /]Proc of the 20th Int Conf on Data Engineering. Los Alamitos, CA: IEEE Computer Society, 2004:1-12.
9Yu B, Li G L, Sollins K, et al. Effective keyword based selection of relational databases [C] //Proc of the ACM SIGMOD Int Conf on Management of Data. New York: ACM, 2007: 139-150.
10Vu Q H, Qoi B C, Papadias D, et al. A graph method for keyword-based selection of the top-k databases [C] //Proc of 2008 ACM SIGMOD Int Conf on Management of Data. New York: ACM, 2008:915-926.

共引文献6

1刘正涛,王建东.Web数据空间技术研究[J].计算机工程与应用,2012,48(7):12-19. 被引量：5
2邓松,陈辉.面向医学领域实体关联检索的深网数据源选择[J].计算机工程与应用,2016,52(10):135-140. 被引量：1
3邓松.实体信息集成检索的深网数据源选择[J].计算机工程,2016,42(10):75-79. 被引量：2
4鲜学丰,崔志明,方立刚,顾才东,孙逊.面向Deep Web本地化数据集成的数据源两层选择模型[J].计算机工程,2017,34(3):32-39. 被引量：3
5袁国华,寇晶晶,李芳.基于领域特征和用户查询取样的Deep Web数据源描述方法[J].图书情报工作,2017,61(15):138-145.
6邓松.面向旅游人文信息集成的Web数据源选择[J].山东大学学报（理学版）,2016,51(3):70-76.

1邓松,万常选.基于主题与概率模型的非合作深网数据源选择[J].软件学报,2017,28(12):3241-3256. 被引量：1
2余建林,陈秋荣.“1234”小学教育专业人才培养模式的架构——以武夷学院小学教育专业为例[J].武夷学院学报,2017,36(10):78-81.
3方浩.统一数据源模式在企业数据分析中的应用[J].电力大数据,2017,20(12):49-51. 被引量：1
4杨晋,李西芝,章世祥.桥梁养护数据挖掘技术综述[J].江苏交通科技,2017,0(6):14-17.
5陈兴会.办公室档案收集整理的探讨[J].科学中国人,2017(7Z):73-74. 被引量：1
6黄奕翔.电力物资供应商的评价与动态选择方案研究[J].时代金融,2017(32):255-255. 被引量：2
7徐永丽.诗歌在唐代小说中的叙事模式[J].哈尔滨工业大学学报（社会科学版）,2018,20(1):91-97. 被引量：1
8李亮,王蕾,王凯,李胜.基于像斑异质度的矢量图与遥感影像变化检测[J].国土资源遥感,2018,30(1):30-36. 被引量：8
9王白根,汪李来,曹环琴.智能电能表信息数据读取研究[J].中国电力企业管理,2017,0(11X):89-90. 被引量：1
10周红.顺应论视角下的口译教学研究[J].明日风尚,2017,0(10):146-146.

计算机科学与探索

2018年第3期

浏览历史

内容加载中请稍等...

Web大数据系统数据源选择

参考文献2

二级参考文献22

共引文献6

相关作者

相关机构

相关主题

浏览历史