摘要
提出一种基于预聚类的潜在语义文献检索算法.首先,对待检索文档集进行预聚类,在潜在语义分析方法的基础上采用k-means聚类算法,寻找出各聚类簇的中心点;其次,在检索时,通过计算查询向量与各聚类簇中心点的相似度来进行检索.此方法有效解决了现有潜在语义文献检索算法在检索时需耗费大量时间计算查询向量与各文本向量之间的相似度的不足.另外还针对文献检索的特点,重新给出特征权重计算方法.实验结果表明,该方法缩短了检索的时间,提高了检索的效率.
This paper proposes a pre - clustering - based latent semantic analysis algorithm for document retrieval. It first clusters the documents using k - means clustering based on the latent semantic analysis, finds out the central point of each cluster, and then calculates the similarity between the query vector and each cluster's central points for retrieval. The algorithm can solve the problem of time - consuming computation of the similarity between the query vector and each text vector in the traditional latent semantic algorithm for document retrieval. In view of the characteristics of document retrieval, it proposes a new method for calculating the feature weights. The results of the experiment show that the new algorithm can reduce the search time, and improve the retrieval efficiency.
出处
《云南民族大学学报(自然科学版)》
CAS
2015年第3期257-260,共4页
Journal of Yunnan Minzu University:Natural Sciences Edition
基金
国家民委科研项目(12YNZ008)
云南省教育厅科学研究基金(2012Y315)
云南民族大学青年基金(11QN08)
关键词
潜在语义分析
文献检索
奇异值分解
latent semantic analysis
document retrieval
singular value decomposition
k - means