期刊文献+

一种基于主题词表的快速中文文本分类技术 被引量:1

A Fast Text Categorization Technology for Chinese Based on Subject Terms List
下载PDF
导出
摘要 针对中文文本的自动分类问题,提出了一种新的算法。该算法的基本思路是构造一个带权值的分类主题词表,该词表采用键树的方式构建,然后利用哈希杂凑法和长词匹配优先原则在主题词表中匹配待分类的文档中的字符串,并统计匹配成功的权值和,以权值和最大者作为分类结果。本算法可以避开中文分词的难点和它对分类结果的影响。理论分析和实验结果表明,该技术分类结果的准确度和时间效率都比较高,其综合性能达到了目前主流技术的水平。 To solving Chinese text categorization, a new algorithm is proposed. The basic idea is to construct a weighted value of classification subject terms list firstly, it is constructed in key tree, then using the Hash function and the principle of giving priority for long term matching to mapping the strings in documentations to the list. After that, calculate the sum of weights of these keywords which have been matched successfully. Finally take the maximum for result of the classification. The algorithm can avoid the difficulty of Chinese word segmentation and its influence on accuracy of result. Theoretical analysis and experimental results indicate that the accuracy and the time efficiency of the algorithm is higher, whose comprehensive performance reaches to the level of current major technology.
作者 刘新 刘任任
出处 《情报学报》 CSSCI 北大核心 2008年第3期323-327,共5页 Journal of the China Society for Scientific and Technical Information
基金 国家自然科学基金资助项目(60673193) 湖南省教育厅重点项目(07A067) 湖南省教育厅一般项目(07C750) 湘潭大学跨学科星火项目(0609016).
关键词 文本分类 主题词表 键树 哈希函数 增益权值 text categorization, subject terms list, key tree, Hash function, gain weight
  • 相关文献

参考文献10

  • 1李晓明,闫宏飞,王继民.搜索引擎——原理、技术与系统[M].北京:科学出版社,2004:197-221.
  • 2Thosten Joachims. Text Categorization with Support Vector Machines: Learning with Many Relevant Features[ EB/OL]. http ://www-ai. informatik, uni-dormund, de/ls8-repots, html.
  • 3李荣陆,王建会,陈晓云,陶晓鹏,胡运发.使用最大熵模型进行中文文本分类[J].计算机研究与发展,2005,42(1):94-101. 被引量:96
  • 4Lewis D D. Navie (Bayes) at forty: the independence assumption in information retrieval [ C ] // Proceedings of The 10^th European Conference on Machine Learning. New York: Spring, 1995:4-15.
  • 5Pan J S, Qiao Y L, Sun S H. A fast K nearest neighbors classification algorithm [ J ]. IEICE Transactions on Fundamentals of Electronics Communications and Computer Sciences, 2004, E87-A(4) :961-963.
  • 6蒋发群,周经野,曹娟.隐式分词的中文输入法及其实现[J].湘潭大学自然科学学报,2003,25(3):26-29. 被引量:1
  • 7王梦云,曹素青.基于字频向量的中文文本自动分类系统[J].情报学报,2000,19(6):644-649. 被引量:17
  • 8傅立云,刘新.基于词典的汉语自动分词算法的改进[J].情报杂志,2006,25(1):40-41. 被引量:10
  • 9严蔚敏 吴伟民.数据结构[M].北京:清华大学出版社,1997..
  • 10谭松波.DRAP文本分类训练系统[EB/OL].[2007-10-02].http://www.searchforum.org.cn/tansongbo/.

二级参考文献30

  • 1吴军,王作英,禹锋,王侠.汉语语料的自动分类[J].中文信息学报,1995,9(4):25-32. 被引量:24
  • 2苏新宁.汉语词切分标引算法的改进[J].情报学报,1996,15(6):426-430. 被引量:9
  • 3陈力为.汉语书面语的分词问题──一个有关全民的信息化问题[J].中文信息学报,1996,10(1):11-13. 被引量:15
  • 4D. D. Lewis. Naive (Bayes) at forty: The independence assumption in information retrieval. In: Proc. of the 10th European Conf. on Machine Learning. New York: Springer,1998, 4-15.
  • 5Y. Yang, X. Lin. A re-examination of text categorization methods. In: The 22nd Annual Int'l ACM SIGIR Conf. onResearch and Development in the Information Retrieval. NewYork: ACM Press, 1999.
  • 6Y. Yang, C. G. Chute. An example based mapping method for text categorization and retrieval. ACM Trans. on Information Systems, 1994, 12(3): 252 -277.
  • 7E. Wiener. A neural network approach to topic spotting. The 4th Annual Syrup. on Document Analysis and Information Retrieval,Las Vegas, NV, 1995.
  • 8R. E. Schapire, Y. Singer. Improved boosting algorithms using confidence-rated predications. In: Proc. of the 11th Annual Conf.on Computational Learning Theory. New York: ACM Press,1998. 80--91.
  • 9T. Joachims. Text categorization with support vector machines:Learning with many relevant features. In: Proc. of the 10th European Conf. on Machine Learning. New York: Springer,1998. 137-142.
  • 10Y. Yang. An evaluation of statistical approaches to text categorization. Information Retrieval, 1999, 1 ( 1 ) : 76-- 88.

共引文献391

同被引文献14

  • 1Choi Suk-OO.利用叙词表开发本体[J].数字图书馆论坛,2007(5):18-23. 被引量:4
  • 2Baeza - Yates R, Ribeiro - Neto B. Modern Information Retrieval [ M ] .王知津,等译.北京:机械工业出版社,2005:7-10.
  • 3岸田和明,武者小路橙子,稻垣几世枝,等.シソーラスの比较评价:概念体系の提示の性能を中心に[J].情报の科学上技术.1988,38(10):565-572.
  • 4Kando - Matsuyama Noriko, Kishida Kazuaki, Mushakoji Sumiko, et al. A comparative evaluation of thesauri concerning "conceptual representability" through an indexing experiment of the documents on library and information science [J]. Library and Information Science, 1988(26) : 103- 114.
  • 5Narang S P. A comparative study of selected information retrieval thesauri in the engineering field [ D ]. Loughborough University of Technology. 1988.
  • 6Milstead J L. ASIS thesaurus of information science and librarianship[M] . 2nd ed. Medford, NJ: Learned Information, 2005.
  • 7Cambridge Science Abstracts. Library and information seienee abstract thesaurus [ EB/OL ]. [ 2010 - 04 - 15 ]. http ://www.
  • 8EBSCO. Library,information science & technology abstracts thesaurus [ EB/OL ] . [ 2010 - 04 - 19 ] . http ://web. ebscohost com/ehost/thesaurus?vid = 3 & hid = 11 & sid = 8a0abf04 - 2eb2- 4896- a60d - 76dac90b44e3% 40sessionmgrl3.
  • 9Wilson Web. Library literature full text thesaurus [ EB/OL]. [ 2010 - 04 - 10] . http ://www. wilsonweb, com.
  • 10Neelameghan A, Rao I K R. Non - hierarchical associative relations : Their types and computer - generation of RT links [J]. Library Science with a Slant to Documentation, 1976, 13(1) : 24- 34.

引证文献1

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部