实体数据库中多相似连接顺序选择策略被引量：3

Multi-Similarity Join Order Selection in Entity Database

下载PDF

导出

摘要按照元组描述的实体对其进行组织和查询处理是一种管理劣质数据的有效方法。考虑到同一个实体的同一属性存在多个描述值,因此基于实体的数据库上的连接是支持多个值的相似性连接。由于多表连接操作的连接顺序对连接性能有着重要的影响,研究了实体数据库上多表连接顺序选择方法,采用基于实体的马尔可夫链蒙特卡洛(Markov chain Monte Carol,MCMC)方法估计出实体数据库的相似性连接操作的结果大小,并以连接结果大小和有无索引作为主要代价,提出了基于实体的多连接顺序优化策略。进一步,通过实验证明了估计连接结果大小的算法在大规模数据上有着显著的优势。 To organize and query entities described by relational tuples is an effective way to manage poor-quality data. Taking into account that the attribute of an entity has more than one description, the similarity join based on entity must consider multiple values. Due to importance effect to the join efficiency of multi-join order, this paper proposes a multi-join order selection algorithm which based on Markov chain Monte Carol （MCMC） method to estimate the size of entity similarity join, and raises a cost model to optimize the order of multi-relation of entity on join problem. Moreover, experimental results show that the estimating algorithm has good performance especially when the size of relations is large.

作者刘雪莉王宏志李建中高宏

机构地区哈尔滨工业大学计算机科学与技术学院

出处《计算机科学与探索》 CSCD 2012年第10期865-876,共12页 Journal of Frontiers of Computer Science and Technology

基金国家自然科学基金 Nos.61003046 61033015 61133002 国家重点基础研究发展规划(973) No.2010CB316200 高等学校博士学科点专项科研基金 No.20102302120054 中央高校基本科研业务费专项资金 No.2013064 RSE-NSFC交流项目 No.61111130189~~

关键词多连接实体相似连接马尔可夫链蒙特卡洛(MCMC) multi-relation entity similarity join Markov chain Monte Carol （MCMC）

分类号 TP311 [自动化与计算机技术—计算机软件与理论]

引文网络
相关文献

参考文献12

1Chaudhuri S, Ganti V, Kaushik R. A primitive operator for similarity joins in data cleaning[C]//Proceedings of the 22nd International Conference on Data Engineering (ICDE '06). Washington, DC, USA: IEEE Computer Society, 2006: 1-5.
2Dong Xin, Halevy A Y, Yu Cong. Data integration with uncertainty[C]//Proceedings of the 33rd International Conference on Very Large Data Bases (VLDB '07), 2007: 687-698.
3Ji Shengyue, Li Guoliang, Li Chen, et al. Efficient interac- tive fuzzy keyword search[C]//Proceedings of the 18th Inter-national Conference on World Wide Web (WWW '09). New York, NY, USA: ACM, 2009: 371-380.
4Hoad T C, Zobel J. Methods for identifying versioned and plagiarized documents[J]. Journal of the American Society for Information Science and Technology, 2003, 54(3): 203-215.
5Broder A Z, Glassman S C, Manasse M S, et al. Syntactic clustering of the Web[J]. Computer Networks and ISDN Systems, 1997, 29(8): 1157-1166.
6Lee H, Ng R T. Similarity join size estimation using locality sensitive Hashing[J]. Proceedings of the VLDB Endowment, 2011, 4(6): 338-349.
7Lee H, Ng R T. Power-law based estimation of set similarity join size[J]. Proceedings of the VLDB Endowment, 2009, 2 (1): 658-669.
8Wu Y-L, Agrawal D, E1 Abbadi A1. Query estimation by adaptive sampling[C]//Proceedings of the 18th International Conference on Data Engineering (ICDE '02), 2002: 639-648.
9Chaudhuri S, Motwani R, Narasayya V. On random sampling over joins[J]. ACM SIGMOD Record, 1999, 28(2): 263-274.
10Jerrum M, Sinclair A. Approximation algorithms for NP-hard problems[M]. Boston, MA: PWS Publishing Co, 1996: 482-520.

同被引文献3

1王宏志,李建中,高宏.一种非清洁数据库的数据模型[J].软件学报,2012,23(3):539-549. 被引量：11
2张岩,杨忠胜,王宏志,高宏,李建中.基于压缩直方图的劣质数据库上相似连接结果大小估计[J].小型微型计算机系统,2012,33(10):2113-2120. 被引量：2
3张岩,杨龙,王宏志.劣质数据库上阈值相似连接结果大小估计[J].计算机学报,2012,35(10):2159-2168. 被引量：6

引证文献3

1张岩,唐兴,王宏志.劣质数据库上查询优化策略[J].小型微型计算机系统,2014,35(11):2410-2415.
2张岩,唐兴.一种劣质数据上统计量的获取方法[J].智能计算机与应用,2014,4(5):26-28.
3Xue-Li Liu,Hong-Zhi Wang,Jian-Zhong Li,Hong Gao.EntityManager： Managing Dirty Data Based on Entity Resolution[J].Journal of Computer Science & Technology,2017,32(3):644-662. 被引量：2

二级引证文献2

1Bo-Han Li,Yi Liu,An-Man Zhang,Wen-Huan Wang,Shuo Wan.A Survey on Blocking Technology of Entity Resolution[J].Journal of Computer Science & Technology,2020,35(4):769-793. 被引量：1
2高广尚.实体解析中基于相似性传递的增量分组研究[J].系统工程理论与实践,2019,39(5):1287-1297. 被引量：1

1刘雪莉,王宏志,李建中,高宏.基于实体的相似性连接算法[J].软件学报,2015,26(6):1421-1437. 被引量：8
2华烈.通过对实体数据库的访问统计图中块的数量及种类[J].电脑编程技巧与维护,2001(12):55-57.
3阿朵.灵活调整上网方式网络连接顺序巧安排[J].电脑迷,2010(8):15-15.
4李桂杰,梅红.多关系SQL查询中连接顺序的优化[J].杭州电子科技大学学报（自然科学版）,2006,26(2):31-34. 被引量：4
5孙德才,王晓霞.一种基于MapReduce的大数据集相似自连接算法[J].计算机科学,2017,44(5):20-25. 被引量：3
6刘艳,郝忠孝.基于Δ-tree的高维数据相似连接算法[J].计算机科学,2011,38(10):157-160. 被引量：1
7风花雪月.网络连接顺序我做主同时连接无线和有线网络时如何优先使用有线[J].电脑迷,2011(5):69-69.
8徐媛媛,陈华辉.基于MapReduce的增量式数据集的相似性连接[J].计算机应用研究,2014,31(11):3369-3374. 被引量：2
9夏军营,徐小泉,熊九龙.利用梯度信息快速提取直线边缘特征[J].中国图象图形学报,2012,17(8):987-994. 被引量：8
10刘艳,郝忠孝.一种基于主存Δ-tree的高维数据自相似连接处理[J].计算机研究与发展,2009,46(6):995-1002. 被引量：4

计算机科学与探索

2012年第10期

浏览历史

内容加载中请稍等...

实体数据库中多相似连接顺序选择策略被引量：3

参考文献12

同被引文献3

引证文献3

二级引证文献2

相关作者

相关机构

相关主题

浏览历史

实体数据库中多相似连接顺序选择策略 被引量：3

参考文献12

同被引文献3

引证文献3

二级引证文献2

相关作者

相关机构

相关主题

浏览历史

实体数据库中多相似连接顺序选择策略被引量：3