期刊文献+

面向不确定文本数据的余弦相似性查询方法 被引量:12

Methods for Similarity Query on Uncertain Data with Cosine Similarity Constraints
下载PDF
导出
摘要 最近邻查询在多个领域具有广泛的应用,如组合过滤、基于位置的服务、决策支持系统等。而且随着Web信息实体抽取、隐私保护信息转化、图像识别等技术的发展和普及,在诸多领域,不确定性文本数据普遍存在,基于信息论的TF-IDF算法,可以将文本型的相似匹配转化为数值型的向量的计算,具有严密性和有效性。但TF-IDF信息的余弦距离不属于度量空间,难于构建索引。为此主要研究了面向不确定文本数据基于余弦相似度的相似性查询方法。通过分析不确定性余弦相似度计算的特性,提出了快速相似度计算方法。通过对余弦距离的计算进行转换,构建改进的索引结构s MVP-tree(statistic multiple vantage point tree),并给出了基于余弦相似度面向不确定性数据的相似度计算方法。最后,结合该相似度计算方法提出了分布式环境下k NN查询和Rk NN查询算法。大量的基于真实数据的实验验证了算法的正确性和有效性。 Nearest neighbor queries have been used in a wide variety of applications such as collaborative filtering,location-based services and decision support systems. Meanwhile, with the development of entity extraction in Web information, information transformation in privacy protection, text recognition in images, in many fields, uncertain text information is ubiquitous. In the field of information theory, the calculation of textual similarity is transformed to the computation of vector similarity by TF-IDF algorithm, which is rigorous and efficient. However, cosine distance based on TF-IDF does not belong to metric distance function, and it is difficult to build indices on cosine similarity. To this end, this paper studies methods for nearest neighbor queries on uncertain data with cosine similarity constraints. Existing methods are efficient either for numerical data or for certain data, but there is no method that can efficiently support uncertain and character data. So this paper first analyzes the property of cosine similarity to boost up similarity computation. Secondly, this paper proposes an efficient method for similarity queries on uncertain data by transforming cosine similarity computation, and designs an improved tree index for metric space,s MVP-tree(statistic multiple vantage point tree). Lastly, this paper extends the framework to a distributed environment and presents k NN query and Rk NN algorithms. The experimental results show that the proposed method is effective and efficient.
出处 《计算机科学与探索》 CSCD 北大核心 2018年第1期49-64,共16页 Journal of Frontiers of Computer Science and Technology
基金 国家自然科学基金No.61472070 国家重点基础研究发展计划(973计划)No.2012CB316201~~
关键词 不确定数据 分布式算法 余弦相似度 相似性查询 uncertain data distributed algorithm cosine similarity similarity query
  • 相关文献

参考文献4

二级参考文献33

  • 1Shirky C Ontology is overrated:Categories,links,and tags[EB/OL].Clay Shirky's Writings about the Internet Website.http://www.shirky.com/writings/ontology_overrated.htral.
  • 2Cattuto C,Loreto V,Pietronero L.Semiotic dynamics and collaborative tagging[J].Proceedings of the National Academy of Sciences,2007,104(5):1461-1464.
  • 3Lawrence K F,Schraefel M C.Freedom and restraint:Tags,vocabularies and ontologies[C]//Proceedings of the 2nd IEEE International Conference on ICTIA,2006:1745-1750.
  • 4Haveliwala T H,Gionis A,Klein D,et al.Evaluating strategies for similarity search on the Web[C]//Proc of 11th International Conference on World Wide Web,2002:432-442.
  • 5Niwa S,Doi T,Honiden S.Folksonomy tag organization method based on the tripartite graph analysis[C]//IJCAI Workshop on Semantic Web for Collaborative Knowledge Acquisition,Hyderabad,India,2007.
  • 6Niwa S,Doi T,Honiden S.Web page recommender system based on Folksonomy mining[C]//Proceedings of the 3rd Int'l Conf on Information Technology New Generations,April,2006:388-393.
  • 7Xu K,Chen Y,Jiang Y.A comparative study of correlation measurements for searching similar tags[C]//Proc of the 4th International Conference on Advanced Data Mining and Applications,2008:709-716.
  • 8Colder S,Huberman B A.The structure of collaborative tagging systems[EB/OL].HP labs:Advanced Research at HP.http://www.hpl.hp.corrd research/idl/papers/tagsdtags.pdf.
  • 9Cattuto C,Schmitz C,Baldassarri A.Network properties of Folksonomies[J].Network Analysis in Natural Sciences and Engineering,2007,20(4):245-262.
  • 10M-Khalifa H S,Davis H C.Measuring the semantic value of Folksonomies[J].Innovations in Information Technology,2006:1-5.

共引文献253

同被引文献107

引证文献12

二级引证文献34

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部