期刊文献+

基于句子的文本表示及中文文本分类研究 被引量:3

Text Representation Based on Sentence and Chinese Text Categorization
下载PDF
导出
摘要 文本挖掘技术是信息资源管理的一项关键技术。向量空间模型是文本挖掘中成熟的文本表示模型,通常以词语或短语作为特征项,但这些特征项只能提供较少的语义信息。为实现基于内容的文本挖掘,本文将文本切分粒度从词语或短语提高到句子,用句子包表示文本,使用句子相似度定义文本相似度,用KNN算法进行中文文本分类,验证模型的可行性。实验证明,基于句子包的KNN算法的平均精度(92.12%)和召回率(92.01%)是比较理想的。 Text mining is a key technology in information resources management. Vector space model is a mature model of text representation in text mining. Words and phrases are commonly used as feature items, but little semantic information is provided by these items. To carry out text mining based on the content, the segmentation granularity is increased from feature items to sentence. Text is represented by a bag of sentences and text similarity is defined by sentence similarity. In order to validate this representation, a Chinese text classifier has been built by KNN algorithm and good average precision (92.12%) and recall (92.01% ) have been achieved in the experiments.
作者 何维 王宇
出处 《情报学报》 CSSCI 北大核心 2009年第6期839-843,共5页 Journal of the China Society for Scientific and Technical Information
基金 国家自然科学重点基金资助项目(70431001).
关键词 信息资源管理 句子包 文本表示 文本分类 information resources management bag of sentences text representation text categorization
  • 相关文献

参考文献12

二级参考文献64

  • 1宋枫溪,高林.文本分类器性能评估指标[J].计算机工程,2004,30(13):107-109. 被引量:32
  • 2SHIYong-feng ZHAOYan-ping.Comparison of Text Categorization Algorithms[J].Wuhan University Journal of Natural Sciences,2004,9(5):798-804. 被引量:4
  • 3张剑,李春平.基于WordNet概念向量空间模型的文本分类[J].计算机工程与应用,2006,42(4):174-178. 被引量:16
  • 4朱靖波,王宝库,姚天顺.一种规则描述语言NPRDL语言[J].东北大学学报(自然科学版),1996,17(6):651-655. 被引量:1
  • 5Grahne G, Zhu J. Efficiently Using Prefix trees in Mining Frequent Itemsets. In:Proc. FIM1, 2003.
  • 6Dumais S, Platt J, Heckerman D, Sahami M. Inductive Learning Algorithms and Representations for Text Categorization. In CIKM98. 148-155.
  • 7The Reuters-21578 Dataset. http://www. daviddlewis, com/resourees/testcolleetions/reuters21578/.
  • 8Feng Jianlin, Liu Huijun, Zou Jing. SAT-MOD: Moderate Itemset Fittest for Text Categorization. WWW2005 in Chiba. Invited Poster Paper, 2005. 1054-1055.
  • 9Robertson S E, Sparck Jones K. Relevance weighting of search terms. J. Amer. Soc. Inform. Sci. 27, 3,129-146. Also reprinted in Willett [1988],143- 160.
  • 10Cohen W W, Hirsh H. 1998. Joins that generalize: text classification using WHIRL. In: Proc. of KDD98, 4th Intl. Conf. on Knowledge Discovery and Data Mining (New York, NY, 1998),169-173.

共引文献120

同被引文献26

  • 1袁军鹏,朱东华,李毅,李连宏,黄进.文本挖掘技术研究进展[J].计算机应用研究,2006,23(2):1-4. 被引量:57
  • 2黄曾阳.HNC(概念层次网络)理论[M].北京:清华大学出版社,1998..
  • 3中国互联网络信息中心(CNNIC).第27次中国互联网络发展状况统计报告[EB/OL].[2011-12-10].http://www.cnnic.net.cn/dtygg/dtgg/201101/t20110118_20250.html.
  • 4HUM S, JIA Z J. Web Text Categorization on GBODSS [ C ] // Pro- ceedings of 4th International Conference on Computer Science & Ed- ucation. 2009:599 -603.
  • 5SALTON G,LESK M E. Computer Evaluation of Indexing and Text Processing[ J ]. Journal of the ACM, 1968,15 ( 1 ) :8 -36.
  • 6YANG Y. An Evaluation of Statistical Approaches to Text Categori- zation[ J]. Journal of Information Retrieval, 1999,1 ( 1/2 ) :67 - 8g.
  • 7WIENER E, PEDERSEN J O, WEIGEND A S. A Neural Network Approach to Topic Spotting [ C ]//Proceedings of the 4th Annum Symposium on Document Analysis and Information Retrieval. Nevad- a,Las Vegas,1995:317 -332.
  • 8CHEN J N, HUANG H K, TIAN S F, et al. Feature Selection for Text Classification with Naive Bayes [ J ]. Expert Systems with Appli- cations,2009,36 (3) :5432 - 5435.
  • 9胡晓,王理.基于改进VSM的Web文本分类方法[J].情报学报,2010,29(5):144-147.
  • 10张运良,张全.基于句类向量空间模型的自动文本分类研究[J].计算机工程,2007,33(22):45-47. 被引量:6

引证文献3

二级引证文献57

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部