摘要
提出了一种基于位置敏感哈希算法的海量文本数据查询算法,通过位置敏感哈希算法将文本数据的特征向量映射到哈希桶中,从而有效地降低了计算复杂度并提高了数据检索的效率。首先,利用TF-IDF特征表示文本的特征向量,并根据给定的哈希函数集把文本的特征向量映射至哈希桶;接下来,利用哈希表为给定的文本计算出与之对应的直方图,通过直方图距离计算文本的相似度;最后,通过计算目标文本集中的文本与待查询文本的相似度进行文本排序,排序分值高的文本作为相关文本返回给用户。实验结果表明,对比已有方法文本提出的算法在MAP以及查全率-查准率曲线两个测度上都获得了较好的性能。
This paper proposes a massive text data queries algorithm based on locality sensitive hashing algorithm, which is used to map the features of texts to hash buckets. The aim of the proposed algorithm is to reduce the calculation complexity and increase data retrieval efficiency. Firstly, using TF-IDF to characterize the feature vectors of texts, and then mapping the text feature vectors to a hash bucket according to a given set of hash functions. Secondly, utilizing a hash table for the given text to calculate the corresponding histogram, and then using the histogram distance to calculate text similarity. Finally, sorting the target texts according to text similarity, and then the texts with higher ranking scores are output to users. Experimental results show that compared with the existing methods, the proposed algorithm performs better in both the metric of MAP and precision-recall curve.
出处
《科技通报》
北大核心
2013年第10期70-72,共3页
Bulletin of Science and Technology
基金
黑龙江省教育厅2013年度科学技术研究(面上)项目(12531089)
关键词
位置敏感哈希
海量文本数据
哈希桶
排序
locality sensitive hashing
massive text data
hash bucket
ranking