基于索引的内存相似性连接算法

Memory Similarity Join Algorithm Based on Index

下载PDF

导出

摘要在传统的相似性连接算法中,精确计算和分区阶段互相独立,精确计算时需要对每个分区中的所有数据进行两两比较,计算量较大。针对该问题,设计一种新的内存索引——距离树,并在其基础上提出两结构内存相似性连接算法。根据数据的潜在分布将其分发到不同的分区中,保证具有一定相似度的数据对分配在同个或相邻的分区内,同时通过树节点之间的位置信息保存分区阶段的计算结果,使精确计算阶段仅需对每个分区中相邻的叶节点数据进行比较计算。实验结果表明,与TOUCH算法相比,基于距离树的算法可使运行速度提高2倍~3倍,并具有更好的可扩展性。 In traditional similarity join algorithms /data partition and refined calculation are isolated.During the refined calculation phase,all pairs of data in the same partition need to be compared with each other which leads to a large number of comparison computations.In order to solve this problem,this paper designs a new memory index：DistanceTree,and proposes an in-memory similarity join algorithm based on it.This algorithm distributes data into different partitions according to the potential distribution of data,ensures the data with same similarity to the same or adjacent partitions,and saves the calculation results of partition phase through the tree node location information.By leveraging the calculation result,only pairs of data in the same or adjacent leaf nodes need to be compared.Experimental results show that similarity join algorithm based on DistanceTree is 2 times ~ 3 times more efficient than TOUCH algorithm and also is more scalable.

作者董明秀王鹏汪洋李秋虹汪卫

机构地区复旦大学计算机科学技术学院复旦大学上海市数据科学重点实验室

出处《计算机工程》 CAS CSCD 北大核心 2016年第1期18-24,30,共8页 Computer Engineering

基金国家自然科学基金资助项目(61103009) 上海市科委大数据专项基金资助项目(13511504800)

关键词相似性连接磁盘查询内存索引分区 similarity join disk query memory index partition

分类号 TP391 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献15

1Ubell M.The Montage Extensible DataBlade Achite-cture[C]//Proceedings of ACM SIGMOD International Conference on Management of Data.Minneapolis,USA:ACM Press,1994:482-493.
2Wang Fusheng.A Data Model and Database for High-resolution Pathology Analytical Image Informatics[J].Journal of Pathology Informatics,2011,2(1):32-40.
3Henzinger M R.Finding Near-duplicate Web Pages:A Large-scale Evaluation of Algorithms[C]//Proceedings of ACM SIGIR Conference on Research and Development in Information Retrieval.Seattle,USA:ACM Press,2006:284-191.
4Hoad T C.Methods for Identifying Versioned and Plagiarized Documents[J].Journal of the American Society for Information Science and Technology,2003,54(3):203-215.
5Nobari S,Tauheed F,Heinis T.TOUCH:In-memory Spatial Join by Hierarchical Data-oriented Partitioning[C]//Proceedings of ACM SIGMOD International Conference on Management of Data.New York,USA:ACM Press,2013:701-712.
6Patel J M,DeWitt D J.Partition Based Spatial-merge Join[C]//Proceedings of ACM SIGMOD International Conference on Management of Data.New York,USA:ACM Press,1996:259-270.
7Ye Wang,Metwally A,Parthasarathy S.Scalable All-pairs Similarity Search in Metric Spaces[C]//Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.New York,USA:ACM Press,2013:829-837.
8Guttman A.R-trees:A Dynamic Index Structure for Spatial Searching[C]//Proceedings of ACM SIGKDD Inter-national Conference on Management of Data.New York,USA:ACM Press,1984:47-57.
9Bryant V.Metric Spaces:Iteration and Application[M].London,UK:Cambridge University Press,1985.
10Toussaint G T.A Simple Linear Algorithm for Intersecting Convex Polygons[J].The Visual Computer,1985,1(2):118-123.

1徐媛媛,陈华辉.基于MapReduce的增量式数据集的相似性连接[J].计算机应用研究,2014,31(11):3369-3374. 被引量：2
2周健雯,李聪聪,熊赟,朱扬勇.一种基于R*树的自相似性连接算法[J].计算机应用与软件,2014,31(8):50-53. 被引量：1
3庞俊,于戈,许嘉,谷峪.基于MapReduce框架的海量数据相似性连接研究进展[J].计算机科学,2015,42(1):1-5. 被引量：16
4韩恺,岳丽华,龚育昌.利用关系数据库系统对半结构化数据进行近似查询[J].中国科学技术大学学报,2005,35(5):674-682. 被引量：3
5庞俊,谷峪,许嘉,于戈.相似性连接查询技术研究进展[J].计算机科学与探索,2013,7(1):1-13. 被引量：15
6陈懿诚,骆吉洲,李建中.Part-Join:基于划分的字符串相似性连接[J].计算机应用研究,2014,31(10):3002-3006.
7余海洋,林琛,陈珂,江弋,邹权.Pass-Join-K:多分段匹配的相似性连接算法[J].计算机科学与探索,2013,7(10):924-932.
8郝建柏,陈贤富,黄双福,杨俊.一种基于模糊近邻标签传递的半监督分类算法[J].微电子学与计算机,2010,27(2):30-33. 被引量：6
9田保军,秦罡,秦婷.实时数据存储管理的研究与设计[J].内蒙古工业大学学报（自然科学版）,2010,29(3):180-185. 被引量：2
10刘雪莉,王宏志,李建中,高宏.基于实体的相似性连接算法[J].软件学报,2015,26(6):1421-1437. 被引量：8

计算机工程

2016年第1期

浏览历史

内容加载中请稍等...

基于索引的内存相似性连接算法

参考文献15

相关作者

相关机构

相关主题

浏览历史