期刊文献+

基于卡方分布的高维数据相似性连接查询算法 被引量:2

Chi-square distribution based similarity join query algorithm on high-dimensional data
下载PDF
导出
摘要 为了解决高维数据相似性连接查询中存在的维度灾难和计算代价高等问题,基于p-稳态分布,将高维数据映射到低维空间。根据卡方分布的性质,证明了如果低维空间的距离大于kε,则原始空间距离大于ε的概率具有一定的下界,从而可以在低维空间以较低的计算代价进行有效过滤。在此基础上,提出了基于卡方分布的高维数据相似性连接查询算法。为了进一步提高查询效率,提出了基于双重过滤的高维数据相似性连接查询算法。利用真实数据集进行了实验,实验结果表明所提方法具有较好的性能。基于卡方分布的相似性连接查询算法召回率可以达到90%以上。基于双重过滤的相似性连接查询算法可以进一步提高性能,但是会损失一定的召回率。对时间性能要求比较高、对召回率要求不太严格的查询任务可以采用基于双重过滤的相似性连接查询算法;反之,可以采用基于卡方分布的相似性连接查询算法。 To deal with the curse of dimensionality and costly computation problems existed in high-dimensional similarity join query, the high-dimensional data were mapped to low-dimensional space based on p-stable distribution. According the definition of chi-square distribution, a theorem was proved: if the distance of two points in low-dimensional space is greater than kε, the probability that the distance of two points in original space is greater than ε has a lower bound. So the effective filtering can be performed at relative low cost in the mapped space. A novel chi-square distribution-based similarity join query algorithm on high-dimensional data was proposed. In order to further improve the query efficiency, another similarity join query algorithm based on double filtering was also proposed. Comprehensive experiments were performed. The experimental results show that the proposed approaches have good performance. The recall of the chi-square distribution-based similarity join query algorithm is larger than 90%. The double filtering based similarity join query algorithm can further improve the efficiency, but it will lose some recall rate. Chi-square distribution based similarity join query algorithm is suitable for the query tasks which are critical of the query performance but not critical of the recall; otherwise, the similarity join query algorithm based on double filtering is favorable.
出处 《计算机应用》 CSCD 北大核心 2016年第7期1993-1997,2037,共6页 journal of Computer Applications
基金 国家自然科学基金资助项目(61501216 61272015) 河南省科技攻关计划项目(152102210332 152102210331) 中原经济区智慧旅游河南省协同创新中心2015年度开放课题(2015-ZHLV-009)~~
关键词 相似性连接查询 高维数据 卡方分布 p-稳态分布 召回率 similarity join query high-dimensional data chi-square distribution p-stable distribution recall
  • 相关文献

参考文献19

  • 1庞俊,谷峪,许嘉,于戈.相似性连接查询技术研究进展[J].计算机科学与探索,2013,7(1):1-13. 被引量:15
  • 2庞俊,于戈,许嘉,谷峪.基于MapReduce框架的海量数据相似性连接研究进展[J].计算机科学,2015,42(1):1-5. 被引量:16
  • 3SHIM K, SRIKANT R, AGRAWAL R. High-dimensional similarity joins [ J]. IEEE Transactions on Knowledge and Data Engineering, 2002, 14(1): 156-171.
  • 4BOHM C, BRAUNMCdLLER B, KREBS F, et al. Epsilon grid or- der: an algorithm for the similarity join on massive high-dimensional data [ C]// Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. New York: ACM, 2001:379 -388.
  • 5KALASHNIKOV D. Super-EGO: fast multi-dimensional similarity join [J]. The VLDB Journal, 2013, 22(4): 561 -585.
  • 6DEAN J, GHEMAWAT S. MapReduee: simplified data processing on large clusters [ C]// Proceedings of the 6th USENIX Symposium on Operating Systems Design and Implementation. San Francisco: USENIX Association, 2004:137 - 150.
  • 7SEIDL T, FRIES S, BODEN B. MR-DSJ: distance-based self-join for large-scale vector data analysis with MapReduce [ C]//Proceed- ings of the 15th BTW Conference on Database Systems for Business, Technology, and Web. Berlin: Springer, 2013:37-56.
  • 8FRIES S, BODEN B, STEPIEN G, et al. PHiDJ: parallel similarity self-join for high-dimensional vector data with MapReduce [ C]// Proceedings of the 30th IEEE International Conference on Data Engi- neering. Piscataway, NJ: IEEE, 2014:796-807.
  • 9LUO W, TAN H, MAO H, et al. Efficient similarity joins on mas- sive high-dimensional datasets using MapReduce [ C]// Proceedings of the 13th IEEE International Conference on Mobile Data Manage- ment. Piscataway, NJ: IEEE, 2012: 1-10.
  • 10LU W, SHEN Y, CHEN S, et al. Efficient processing of k nearest neighbor joins using MapReduce [ J]. Proceedings of the VLDB Endowment, 2012, 5(10) : 1016 - 1027.

二级参考文献59

  • 1李国杰.大数据研究的科学价值[J].中国计算机学会通讯,2012,8(9):8-15.
  • 2CIO时代网.浅析大数据的特点[EB/OL]. (2012-05-08). [2013-5-20] .http://www.ciotimes.com /baike/62989.html.
  • 3Brinkhoff T,Kriegel H P,Seeger B.Efficient processing of spatial joins using R-trees. ACM SIGMOD Int. Conf. on Management of Data . 1993
  • 4G. Luo,J. Naughton,C. Ellman.A Non-Blocking Parallel Spatial Join Algorithm. Proc International Conference on Data Engineering . 2002
  • 5Dittrich,J -P,B. Seeger.Data Redundancy and Duplicate Detection in Spatial Join Processing. Proceedings of the 16th International Conference on Data Engineering . 2000
  • 6Tao Y,Papadias D.Range Aggregate Processing in Spatial Databases. IEEE Transactions on Knowledge and Data Engineering . 2004
  • 7Zhang D,Tsotras V J,Gunopulos D.Efficient Aggregation over Objects with Extent. Proceedings of the 21th ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems . 2002
  • 8M Zhu.Top-k spatial joins. IEEE Transactions on Knowledge and Data Engineering . 2005
  • 9Jeffrey Dean,Sanjay Ghemawat.MapReduce:Simplified Data Processing on Large Clusters. OSDI’’04:Sixth Symposium on Operating System Design andImplementation . 2004
  • 10Patel JM,DeWitt DJ.Partition based spatial-merge join. SIGMOD Record . 1996

共引文献31

同被引文献14

引证文献2

二级引证文献10

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部