期刊文献+

一种基于逆向匹配算法的中文文本分类技术 被引量:3

Technology for Chinese text categorization based on reverse matching algorithm
下载PDF
导出
摘要 针对中文文本的自动分类问题,提出了一种逆向匹配算法。该算法的基本思路是构造一个带权值的分类主题词表,然后用词表中的关键词在待分类的文档中进行逆向匹配,并统计匹配成功的权值和,以权值和最大者作为分类结果。本算法可以避开中文分词的难点和它对分类结果的影响。理论分析和实验结果表明,该技术分类结果的准确度和时间效率都比较高,其综合性能达到了目前主流技术的水平。 Concerning Chinese text categorization, a reverse matching algorithm was proposed. The basic idea was to construct a weighted value of classification subject terms list firstly, then use keywords in the list to reverse match in documentations. After that, the sum of weights of these key words that had been matched successfully was calculated, in the end the maximum was taken as the result of the classification. The algorithm can avoid the difficulty of Chinese word segmentation and its influence on accuracy of result. Theoretical analysis and experimental results indicate that the accuracy and the time efficiency of the algorithm are higher, whose comprehensive performance reaches to the level of current major technology.
作者 刘新 刘任任
出处 《计算机应用》 CSCD 北大核心 2008年第4期945-947,共3页 journal of Computer Applications
基金 国家自然科学基金资助项目(60673193) 湖南省教育厅一般项目(07C750) 湖南省教育厅划块项目(06C870)
关键词 文本分类 逆向匹配算法 增益权值 主题词表 text categorization reverse matching algorithm gain weight subject terms list
  • 相关文献

参考文献8

  • 1李晓明,闫宏飞,王继民.搜索引擎-原理、技术与系统[M].北京:科学出版社,2004:1-5.
  • 2JOACHIMS T. Text Categorization with Support Vector Machines: Learning with Many Relevant Features [ EB/OL]. [ 2007 - 10 - 02]. http://www-ai. informatik. uni-dormund. de/ls8-repots. html.
  • 3李荣陆,王建会,陈晓云,陶晓鹏,胡运发.使用最大熵模型进行中文文本分类[J].计算机研究与发展,2005,42(1):94-101. 被引量:95
  • 4LEWIS D D. Navie(Bayes) at forty: the independence assumption in information retrieval[ C]// The 10th European Conference on Machine Learning. New York: Spring, 1998:4 - 15.
  • 5PAN J S, QIAO Y L, SUN S H. A fast K nearest neighbors classification algorithm[ J]. IEICE Transactions on Fundamentals of Electronics Communications and Computer Sciences, 2004, E87-A(4) : 961 - 963.
  • 6王梦云,曹素青.基于字频向量的中文文本自动分类系统[J].情报学报,2000,19(6):644-649. 被引量:17
  • 7严蔚敏 吴伟民.数据结构[M].北京:清华大学出版社,1997..
  • 8谭松波.DRAP文本分类训练系统[EB/OL].[2007-10-02].http://www.searchforum.org.cn/tansongbo/.

二级参考文献18

  • 1吴军,王作英,禹锋,王侠.汉语语料的自动分类[J].中文信息学报,1995,9(4):25-32. 被引量:24
  • 2D. D. Lewis. Naive (Bayes) at forty: The independence assumption in information retrieval. In: Proc. of the 10th European Conf. on Machine Learning. New York: Springer,1998, 4-15.
  • 3Y. Yang, X. Lin. A re-examination of text categorization methods. In: The 22nd Annual Int'l ACM SIGIR Conf. onResearch and Development in the Information Retrieval. NewYork: ACM Press, 1999.
  • 4Y. Yang, C. G. Chute. An example based mapping method for text categorization and retrieval. ACM Trans. on Information Systems, 1994, 12(3): 252 -277.
  • 5E. Wiener. A neural network approach to topic spotting. The 4th Annual Syrup. on Document Analysis and Information Retrieval,Las Vegas, NV, 1995.
  • 6R. E. Schapire, Y. Singer. Improved boosting algorithms using confidence-rated predications. In: Proc. of the 11th Annual Conf.on Computational Learning Theory. New York: ACM Press,1998. 80--91.
  • 7T. Joachims. Text categorization with support vector machines:Learning with many relevant features. In: Proc. of the 10th European Conf. on Machine Learning. New York: Springer,1998. 137-142.
  • 8Y. Yang. An evaluation of statistical approaches to text categorization. Information Retrieval, 1999, 1 ( 1 ) : 76-- 88.
  • 9R. Adwait. Maximum entropy models for natural language ambiguity resolution: [ Ph. D. dissertation ] . Pennsylvania:University of Pennsylvania, 1998.
  • 10R. Adwait. A maximum entropy model for part-of-speech tagging. The Empirical Methods in Natural Language Processing Conference, Philadelphia, USA, 1996.

共引文献393

同被引文献14

引证文献3

二级引证文献15

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部