期刊文献+

基于多策略的文档信息过滤技术的应用研究

Application research of text information filtering based on multi-strategy
下载PDF
导出
摘要 为了提高文本信息检索的查准率和缩短检索时间,提出了一种基于多策略的文档过滤算法。该算法根据潜在词性特征初步生成候选词,采用基于标题的特征词发现扩充候选词,使用改进的TFIDF对候选词的特征进行加权合成,去除不符合条件词,求出用户需求向量和待过滤文档向量的相似度,将相似度大于一定阈值的文档提供给用户。从实验参数确定、策略对结果的影响两方面论证了文档信息过滤算法的可行性。实验结果表明,基于多策略的文档信息过滤算法能够提高信息检索的查准率,改善信息检索的质量。 In order to improve the efficiency of information retrieval, a document filtering algorithm based on multi-strategy is proposed. First, the algorithm generates candidate words according to potential feature words, then expands candidates words based on the characteristics words of the title, Second, the algorithm use improved TFIDF method to synthesis candidate words, and remove the word which do not meet the requirements, Third, calculates the similarity between user needed documents vector and the to be filtered documents. Finally the document that greater than a certain threshold similarity value will be provided to users. We demonstrate the feasibility information filtering algorithm both from experimental parameters and the results of the strategies. The experimental results show that our approach based on multi-strategy text information filtering algorithm can significantly outperforms the traditional information filtering method.
作者 杨陟卓 韩燮
出处 《计算机工程与设计》 CSCD 北大核心 2009年第5期1262-1266,共5页 Computer Engineering and Design
关键词 信息检索 信息过滤 文本特征抽取 TFIDF+ 空间向量模型 information retrieval information filtering text feature extraction TFIDF+ VSM
  • 相关文献

参考文献8

  • 1Wang Houfeng, Li Sujian, Yu Shiwen, et al. A combining approach to automatic keyphrases indexing for Chinese news documents[C].Gelbukh A.Computational Linguistics and Intelligent Text Processing, Lecture Notes in Computer Science 2945. Springer-Verlag, 2004:435-438.
  • 2Li Sujian,Wand Houfeng,Yu Shiwen,et al.News-oriented automatic Chinese keyword indexing[C].Proceedings of the Second SIGHAN Workshop on Chinese Language Processing, 2003: 92-97.
  • 3Stevens M E.Automatic indexing: a state-of-the-art report[R]. Washington, DC,US:Govemment Printing Office, 1970.
  • 4Chien L F.PAT-tree-based keyword extraction for Chinese information retrieval [C]. Proceedings of the ACM SIGIR International Conference on Information Retrieval, 1997:50-59.
  • 5Tumey PD.Learning algorithms for keyphrase extraction[C]. Information Retrieval,2000,2(4):303-336.
  • 6王永成,顾晓明,王丽霞.中文文献主题的自动标引[J].情报学报,1998,17(3):219-225. 被引量:24
  • 7张玉叶,李连,刘海见,王春歆.文本过滤中的特征抽取应用研究[J].海军航空工程学院学报,2005,20(1):139-141. 被引量:4
  • 8ICTCLAS中文自然语言处理开放平台[EB/OL].http://www.nip.org.cn/project/project.php?proj_id=6.

二级参考文献8

  • 1[3]姚天顺,朱靖波,张俐,等.自然语言理解--种让机器懂得人类语言的研究[M].北京:清华大学出版社,2003
  • 2[4]Dunning T E. Accurate methods or the statistics of surprise and coincidence[C]. Computational Linguistics, 1993:61-74
  • 3[5]Yang Y, Pedersen J O. A comparative study on featureselection in text categorization[A]∥Proc of the 14th Int'l Conference Machine Learning (ICML'97)[C].1997:412-420
  • 4[6]ladenic M D, Grobelnik M. Feature selection for unbalanced class distribution and native bayes[EB/OL]. http:∥www. cs.cmu. edu/textlearning
  • 5[美]哈罗德·博科等,.文摘的概念与方法[M]书目文献出版社,1991.
  • 6李凡,鲁明羽,陆玉昌.关于文本特征抽取新方法的研究[J].清华大学学报(自然科学版),2001,41(7):98-101. 被引量:78
  • 7朱寰,阮彤,于庆喜.文本分割算法对中文信息过滤影响研究[J].计算机工程与应用,2002,38(13):62-65. 被引量:11
  • 8代六玲,黄河燕,陈肇雄.中文文本分类中特征抽取方法的比较研究[J].中文信息学报,2004,18(1):26-32. 被引量:228

共引文献25

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部