摘要
为了提高文本信息检索的查准率和缩短检索时间,提出了一种基于多策略的文档过滤算法。该算法根据潜在词性特征初步生成候选词,采用基于标题的特征词发现扩充候选词,使用改进的TFIDF对候选词的特征进行加权合成,去除不符合条件词,求出用户需求向量和待过滤文档向量的相似度,将相似度大于一定阈值的文档提供给用户。从实验参数确定、策略对结果的影响两方面论证了文档信息过滤算法的可行性。实验结果表明,基于多策略的文档信息过滤算法能够提高信息检索的查准率,改善信息检索的质量。
In order to improve the efficiency of information retrieval, a document filtering algorithm based on multi-strategy is proposed. First, the algorithm generates candidate words according to potential feature words, then expands candidates words based on the characteristics words of the title, Second, the algorithm use improved TFIDF method to synthesis candidate words, and remove the word which do not meet the requirements, Third, calculates the similarity between user needed documents vector and the to be filtered documents. Finally the document that greater than a certain threshold similarity value will be provided to users. We demonstrate the feasibility information filtering algorithm both from experimental parameters and the results of the strategies. The experimental results show that our approach based on multi-strategy text information filtering algorithm can significantly outperforms the traditional information filtering method.
出处
《计算机工程与设计》
CSCD
北大核心
2009年第5期1262-1266,共5页
Computer Engineering and Design