摘要
近年来,在多媒体信息检索、相似性连接和时问序列匹配等数据库领域的相似搜索研究备受关注。绝大部分工作都是在欧式空间条件下,使用度量距离函数计算最近邻(如kNN、kNNJ)来解决搜索目标集合问题。但已有研究表明,此条件下的搜索结果准确性很容易受到高差异维度的影响,且对应的解决方案尚缺乏灵活性和顽健性。首先提出了单机环境下动态子空间(部分维度)下相似搜索问题及解决方案。随着数据规模的扩大,单机算法不能很好地扩展,随之又提出了Hadoop框架下的分布式算法。实验证实,在不影响准确率的情况下,分布式算法的性能要优于集中式算法。
In recent years, such database fields as multimedia information retrieval, similarity join and time series matching, where similarity search has attracted much attention. Existing researches mostly compute nearest neighbor to solve problems about search target set, such as kNN and kNNJ, by metric distance functions in the Euclidean space. But some studies showed that high dissimilarity dimensions had got great effect on the accuracy of answer and flexibility and robustness still were lacked in corresponding solutions. Thus centralized dynamic subspace or partial dimensions similarity search problem and algorithms were proposed at first. Furthermore, with the emerge of very large dataset, centralized algorithms can~ extend very well. Finally, the distributed ones under hadoop framework were proposed. Experiments prove that distributed algorithms outperform centralized ones in the performance without accuracy loss.
出处
《电信科学》
北大核心
2015年第7期63-74,共12页
Telecommunications Science