摘要
最近邻查询在多个领域具有广泛的应用,如组合过滤、基于位置的服务、决策支持系统等。而且随着Web信息实体抽取、隐私保护信息转化、图像识别等技术的发展和普及,在诸多领域,不确定性文本数据普遍存在,基于信息论的TF-IDF算法,可以将文本型的相似匹配转化为数值型的向量的计算,具有严密性和有效性。但TF-IDF信息的余弦距离不属于度量空间,难于构建索引。为此主要研究了面向不确定文本数据基于余弦相似度的相似性查询方法。通过分析不确定性余弦相似度计算的特性,提出了快速相似度计算方法。通过对余弦距离的计算进行转换,构建改进的索引结构s MVP-tree(statistic multiple vantage point tree),并给出了基于余弦相似度面向不确定性数据的相似度计算方法。最后,结合该相似度计算方法提出了分布式环境下k NN查询和Rk NN查询算法。大量的基于真实数据的实验验证了算法的正确性和有效性。
Nearest neighbor queries have been used in a wide variety of applications such as collaborative filtering,location-based services and decision support systems. Meanwhile, with the development of entity extraction in Web information, information transformation in privacy protection, text recognition in images, in many fields, uncertain text information is ubiquitous. In the field of information theory, the calculation of textual similarity is transformed to the computation of vector similarity by TF-IDF algorithm, which is rigorous and efficient. However, cosine distance based on TF-IDF does not belong to metric distance function, and it is difficult to build indices on cosine similarity. To this end, this paper studies methods for nearest neighbor queries on uncertain data with cosine similarity constraints. Existing methods are efficient either for numerical data or for certain data, but there is no method that can efficiently support uncertain and character data. So this paper first analyzes the property of cosine similarity to boost up similarity computation. Secondly, this paper proposes an efficient method for similarity queries on uncertain data by transforming cosine similarity computation, and designs an improved tree index for metric space,s MVP-tree(statistic multiple vantage point tree). Lastly, this paper extends the framework to a distributed environment and presents k NN query and Rk NN algorithms. The experimental results show that the proposed method is effective and efficient.
出处
《计算机科学与探索》
CSCD
北大核心
2018年第1期49-64,共16页
Journal of Frontiers of Computer Science and Technology
基金
国家自然科学基金No.61472070
国家重点基础研究发展计划(973计划)No.2012CB316201~~