基于多策略的文档信息过滤技术的应用研究

Application research of text information filtering based on multi-strategy

下载PDF

导出

摘要为了提高文本信息检索的查准率和缩短检索时间,提出了一种基于多策略的文档过滤算法。该算法根据潜在词性特征初步生成候选词,采用基于标题的特征词发现扩充候选词,使用改进的TFIDF对候选词的特征进行加权合成,去除不符合条件词,求出用户需求向量和待过滤文档向量的相似度,将相似度大于一定阈值的文档提供给用户。从实验参数确定、策略对结果的影响两方面论证了文档信息过滤算法的可行性。实验结果表明,基于多策略的文档信息过滤算法能够提高信息检索的查准率,改善信息检索的质量。 In order to improve the efficiency of information retrieval, a document filtering algorithm based on multi-strategy is proposed. First, the algorithm generates candidate words according to potential feature words, then expands candidates words based on the characteristics words of the title, Second, the algorithm use improved TFIDF method to synthesis candidate words, and remove the word which do not meet the requirements, Third, calculates the similarity between user needed documents vector and the to be filtered documents. Finally the document that greater than a certain threshold similarity value will be provided to users. We demonstrate the feasibility information filtering algorithm both from experimental parameters and the results of the strategies. The experimental results show that our approach based on multi-strategy text information filtering algorithm can significantly outperforms the traditional information filtering method.

作者杨陟卓韩燮

机构地区中北大学电子与计算机科学技术学院

出处《计算机工程与设计》 CSCD 北大核心 2009年第5期1262-1266,共5页 Computer Engineering and Design

关键词信息检索信息过滤文本特征抽取 TFIDF+ 空间向量模型 information retrieval information filtering text feature extraction TFIDF＋ VSM

分类号 TP391.1 [自动化与计算机技术—计算机应用技术] TP301.4 [自动化与计算机技术—计算机系统结构]

引文网络
相关文献

参考文献8

1Wang Houfeng, Li Sujian, Yu Shiwen, et al. A combining approach to automatic keyphrases indexing for Chinese news documents[C].Gelbukh A.Computational Linguistics and Intelligent Text Processing, Lecture Notes in Computer Science 2945. Springer-Verlag, 2004:435-438.
2Li Sujian,Wand Houfeng,Yu Shiwen,et al.News-oriented automatic Chinese keyword indexing[C].Proceedings of the Second SIGHAN Workshop on Chinese Language Processing, 2003: 92-97.
3Stevens M E.Automatic indexing: a state-of-the-art report[R]. Washington, DC,US:Govemment Printing Office, 1970.
4Chien L F.PAT-tree-based keyword extraction for Chinese information retrieval [C]. Proceedings of the ACM SIGIR International Conference on Information Retrieval, 1997:50-59.
5Tumey PD.Learning algorithms for keyphrase extraction[C]. Information Retrieval,2000,2(4):303-336.
6王永成,顾晓明,王丽霞.中文文献主题的自动标引[J].情报学报,1998,17(3):219-225. 被引量：24
7张玉叶,李连,刘海见,王春歆.文本过滤中的特征抽取应用研究[J].海军航空工程学院学报,2005,20(1):139-141. 被引量：4
8ICTCLAS中文自然语言处理开放平台[EB/OL].http://www.nip.org.cn/project/project.php?proj_id=6.

二级参考文献8

1[3]姚天顺,朱靖波,张俐,等.自然语言理解--种让机器懂得人类语言的研究[M].北京:清华大学出版社,2003
2[4]Dunning T E. Accurate methods or the statistics of surprise and coincidence[C]. Computational Linguistics, 1993:61-74
3[5]Yang Y, Pedersen J O. A comparative study on featureselection in text categorization[A]∥Proc of the 14th Int'l Conference Machine Learning (ICML'97)[C].1997:412-420
4[6]ladenic M D, Grobelnik M. Feature selection for unbalanced class distribution and native bayes[EB/OL]. http:∥www. cs.cmu. edu/textlearning
5[美]哈罗德·博科等,.文摘的概念与方法[M]书目文献出版社,1991.
6李凡,鲁明羽,陆玉昌.关于文本特征抽取新方法的研究[J].清华大学学报（自然科学版）,2001,41(7):98-101. 被引量：78
7朱寰,阮彤,于庆喜.文本分割算法对中文信息过滤影响研究[J].计算机工程与应用,2002,38(13):62-65. 被引量：11
8代六玲,黄河燕,陈肇雄.中文文本分类中特征抽取方法的比较研究[J].中文信息学报,2004,18(1):26-32. 被引量：228

共引文献25

1周健.汉语文献自动标引的技术难点和发展方向[J].图书馆建设,1993(2):42-43.
2张琪玉.汉语关键词法探讨[J].图书馆论坛,1993,13(1):3-7. 被引量：7
3王春歆,李连,张玉叶.树形结构SVMs多类分类的研究[J].海军航空工程学院学报,2005,20(2):254-256. 被引量：4
4张双圈,周拴龙.汉字信息处理三十年[J].现代图书情报技术,1994(3):49-54. 被引量：1
5阳学军.基于网络和人工智能的图书馆信息管理系统研究[J].岳阳职业技术学院学报,2005,20(3):59-61. 被引量：5
6靳从,樊春丽,杨静宇.主题词自动标引中的知识处理方法[J].情报理论与实践,1996,19(2):30-33. 被引量：3
7暴国霞,孙芳.通用文献信息检索系统(GIRS)设计探讨[J].图书馆学研究,2007(8):57-60. 被引量：2
8LI Yanling,DAI Guanzhong,ZHU Yehang,QIN Sen.A High-Performance Extraction Method for Public Opinion on Internet[J].Wuhan University Journal of Natural Sciences,2007,12(5):902-906. 被引量：3
9杨陟卓,韩燮.一种基于特征抽取的文档信息过滤算法研究[J].现代图书情报技术,2008(4):29-34. 被引量：3
10裘江南,罗志成,王延章.基于词汇链的应急预案主题抽取方法研究[J].情报学报,2008,27(6):891-896. 被引量：5

1李淑文.试论文本自动分类[J].现代计算机,2004,10(7):38-41. 被引量：2
2李鹏伟,葛文英.云计算环境下虚拟机动态部署研究[J].计算机测量与控制,2013,21(5):1374-1376. 被引量：11
3林琛,李弼程,宋辉.一种基于PCA和RS的文本过滤方法[J].微计算机信息,2005,21(11X):156-158. 被引量：5
4付鹏,林政,袁凤程,林海伦,王伟平,孟丹.基于卷积神经网络和用户信息的微博话题追踪模型[J].模式识别与人工智能,2017,30(1):73-80. 被引量：6
5杨陟卓,韩燮.一种基于特征抽取的文档信息过滤算法研究[J].现代图书情报技术,2008(4):29-34. 被引量：3
6邱莎,王付艳,申浩如,段玻,阿圆,丁海燕.基于含边界词性特征的中文命名实体识别[J].计算机工程,2012,38(13):128-130. 被引量：7
7肖宇伦,欧阳纯萍,刘志明.基于SVM和词向量的Web新闻倾向性分析[J].现代计算机（中旬刊）,2016(5):52-55. 被引量：1
8田正军,张鸿彦.文本自动分类在邮件过滤系统中的应用[J].郑州经济管理干部学院学报,2005,20(2):90-92.
9杨佳,张金广,杨龙,江萍,魏晓莉.基于本体概念集合相似度的语义Web服务匹配[J].计算机技术与发展,2012,22(8):56-59. 被引量：1
10符保龙.文本特征抽取中基于基因集编码的遗传退火算法[J].广西科学院学报,2012,28(1):1-3.

计算机工程与设计

2009年第5期

浏览历史

内容加载中请稍等...

基于多策略的文档信息过滤技术的应用研究

参考文献8

二级参考文献8

共引文献25

相关作者

相关机构

相关主题

浏览历史