期刊文献+

信息检索用户查询语句的停用词过滤 被引量:16

Removal of Stop Word in Users’ Request for Information Retrieval
下载PDF
导出
摘要 针对以自然语言形式提出的查询请求,区分信息需求表述和信息内容两部分。基于近20万语句的查询语料库和背景语料人民日报对照,提出汉语通用停用词和查询专用的相对停用词,采用左右熵和Ngram方法及KL距离脱机构造相应候选词表。根据候选词语的Bigram属性和句中不同位置的分布特点,给出了在线动态识别停用词的方法。实验结果表明,该文的方法比单纯根据静态停用词表标注效果要好。 Information need expression and information content words are distinguished for users requests in natural language. Based on the analysis of 200 000 query sentences and the People's Daily corpus, absolute stop word and relative stop word are proposed. The candidate stop word lists are built offline by means of left/right entropy, Ngram and KL divergence. With the information of Bigram and different position distributions, this paper gives a dynamic identification algorithm for the actual stop word in users' request expression. The experiment shows the method is superior to the baseline which only consults a stop word list.
作者 熊文新 宋柔
出处 《计算机工程》 CAS CSCD 北大核心 2007年第6期195-197,共3页 Computer Engineering
基金 国家自然科学基金资助项目(60272055) 国家"863"计划基金资助项目(2001AA114111) 教育部科学技术研究资助重点项目(00128) 教育部人文社会科学重点研究基地资助重大项目(02JAZJD740007)
关键词 用户查询 停用词 构造 识别 Users request Stop word Building Identification
  • 相关文献

参考文献7

  • 1Yang Y,Pedersen J.A Comparative Study on Feature Selection in Text Categorization[C]//Proceedings of the 14^th International Conference on Machine Learning.1997:412-420.
  • 2Fox C.Lexical Analysis and Stoplist,Information Retrieva1:Data Structures and Algorithms,Upper Saddle River[M].New Jersey:Prentice Hall,1992.
  • 3Sinka M,Corne D.Towards Modernised and Web-Specific Stoplists for Web Document Analysis[C]//Proceedings of the IEEE/WIC International Conference on Web Intelligence,Halifax,Canada.2003.
  • 4顾益军,樊孝忠,王建华,汪涛,黄维金.中文停用词表的自动选取[J].北京理工大学学报,2005,25(4):337-340. 被引量:35
  • 5Lo R,He B,Ounis I.Automatically Building a Stopword List for an Information Retrieval System[C]//Proceedings of the 5^th Dutch-belgian Information Retrieval Workshop,Utrecht,the Netherlands.2005.
  • 6熊文新,宋柔.信息检索查询语句的表述分析[C]//第4届全国语言文字应用学术研讨会,成都.2005.
  • 7Manning,C,Schutz H.Foundations of Statistical Natural Language Processing[M].Cambridge,MA:MIT Press,1999.

二级参考文献12

  • 1Hart G W. To decode short cryptograms[A]. Communications of the ACM[C]. New York: Association for Computing Machinery, 1994.102-108.
  • 2Van Rijsbergen C J. Information retrieval[M]. London: Butterworths Scientific Publication, 1975.
  • 3Fox C. Lexical analysis and stoplists(including the ‘Brown Corpus’stoplist), information retrieval: Data structures and algorithms[M]. Upper Saddle River, New Jersey: Prentice Hall, 1992.
  • 4Sinka M P, Corne D W. Web intelligence WI 2003[A]. Proceedings IEEE/WIC International Conference on Soc[C]. Los Alamitos: IEEE Comput, 2003.396-402.
  • 5Silva C, Ribeiro B. The importance of stop word removal on recall values in text categorization[J]. Neural Networks, 2003, 3:20-24.
  • 6Yang Y. Pedersen J O. A comparative study on feature selection in text categorization[A]. Proceedings of ICML-97, 14th International Conference on Machine Learning[C]. San Francisco: Morgan Kaufmann Publishers Inc., 1997.412-420.
  • 7Luhn H P. The automatic creation of literature abstracts[J]. IBM Journal of Research and Development, 1958, 2(2):159-165.
  • 8Harman D. An experimental study of factors important in document ranking[A]. Proceedings of the 1986 ACM Conference on Research and Developments in Information Retrieval[C]. New York: Association for Computing Machinery, 1986.186-193.
  • 9北京大学计算语言学研究所. 1998年1月人民日报切分、标注语料库[EB/OL]. http:∥icl.pku.edu.cn//icl_groups/corpus/dwldform1.asp,2001-05-10/2004-04-01. (in Chinese)Institute of Computational Linguistics Peking University. Word segmentation corpus from People's Daily(January 1998)[EB/OL]. http:∥icl.pku.edu.cn//icl_groups/corpus/dwldform1.asp,2001-05-10/2004-04-01.
  • 10自然语言处理开放平台. 文本分类语料库(复旦)训练语料[EB/OL]. http:∥www.nlp.org.cn/categories,2003-06-23/2004-05-01.(in Chinese)CNLP Platform. Training subset from text categorization corpus(Fudan)[EB/OL]. http:∥www.nlp.org.cn/categories,2003-06-23/2004-05-01.

共引文献35

同被引文献110

引证文献16

二级引证文献226

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部