一种基于主题词表的快速中文文本分类技术被引量：1

A Fast Text Categorization Technology for Chinese Based on Subject Terms List

下载PDF

导出

摘要针对中文文本的自动分类问题，提出了一种新的算法。该算法的基本思路是构造一个带权值的分类主题词表，该词表采用键树的方式构建，然后利用哈希杂凑法和长词匹配优先原则在主题词表中匹配待分类的文档中的字符串，并统计匹配成功的权值和，以权值和最大者作为分类结果。本算法可以避开中文分词的难点和它对分类结果的影响。理论分析和实验结果表明，该技术分类结果的准确度和时间效率都比较高，其综合性能达到了目前主流技术的水平。 To solving Chinese text categorization, a new algorithm is proposed. The basic idea is to construct a weighted value of classification subject terms list firstly, it is constructed in key tree, then using the Hash function and the principle of giving priority for long term matching to mapping the strings in documentations to the list. After that, calculate the sum of weights of these keywords which have been matched successfully. Finally take the maximum for result of the classification. The algorithm can avoid the difficulty of Chinese word segmentation and its influence on accuracy of result. Theoretical analysis and experimental results indicate that the accuracy and the time efficiency of the algorithm is higher, whose comprehensive performance reaches to the level of current major technology.

作者刘新刘任任

机构地区湘潭大学信息工程学院

出处《情报学报》 CSSCI 北大核心 2008年第3期323-327,共5页 Journal of the China Society for Scientific and Technical Information

基金国家自然科学基金资助项目（60673193）湖南省教育厅重点项目（07A067）湖南省教育厅一般项目（07C750）湘潭大学跨学科星火项目（0609016）.

关键词文本分类主题词表键树哈希函数增益权值 text categorization, subject terms list, key tree, Hash function, gain weight

分类号 G254.0 [文化科学—图书馆学] TP391 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献10

1李晓明,闫宏飞,王继民.搜索引擎——原理、技术与系统[M].北京:科学出版社,2004:197-221.
2Thosten Joachims. Text Categorization with Support Vector Machines: Learning with Many Relevant Features[ EB/OL]. http ://www-ai. informatik, uni-dormund, de/ls8-repots, html.
3李荣陆,王建会,陈晓云,陶晓鹏,胡运发.使用最大熵模型进行中文文本分类[J].计算机研究与发展,2005,42(1):94-101. 被引量：96
4Lewis D D. Navie (Bayes) at forty: the independence assumption in information retrieval [ C ] // Proceedings of The 10^th European Conference on Machine Learning. New York: Spring, 1995:4-15.
5Pan J S, Qiao Y L, Sun S H. A fast K nearest neighbors classification algorithm [ J ]. IEICE Transactions on Fundamentals of Electronics Communications and Computer Sciences, 2004, E87-A(4) :961-963.
6蒋发群,周经野,曹娟.隐式分词的中文输入法及其实现[J].湘潭大学自然科学学报,2003,25(3):26-29. 被引量：1
7王梦云,曹素青.基于字频向量的中文文本自动分类系统[J].情报学报,2000,19(6):644-649. 被引量：17
8傅立云,刘新.基于词典的汉语自动分词算法的改进[J].情报杂志,2006,25(1):40-41. 被引量：10
9严蔚敏吴伟民.数据结构[M].北京：清华大学出版社,1997..
10谭松波.DRAP文本分类训练系统[EB/OL].[2007-10-02].http://www.searchforum.org.cn/tansongbo/.

二级参考文献30

1吴军,王作英,禹锋,王侠.汉语语料的自动分类[J].中文信息学报,1995,9(4):25-32. 被引量：24
2苏新宁.汉语词切分标引算法的改进[J].情报学报,1996,15(6):426-430. 被引量：9
3陈力为.汉语书面语的分词问题──一个有关全民的信息化问题[J].中文信息学报,1996,10(1):11-13. 被引量：15
4D. D. Lewis. Naive (Bayes) at forty: The independence assumption in information retrieval. In: Proc. of the 10th European Conf. on Machine Learning. New York: Springer,1998, 4-15.
5Y. Yang, X. Lin. A re-examination of text categorization methods. In: The 22nd Annual Int'l ACM SIGIR Conf. onResearch and Development in the Information Retrieval. NewYork: ACM Press, 1999.
6Y. Yang, C. G. Chute. An example based mapping method for text categorization and retrieval. ACM Trans. on Information Systems, 1994, 12(3): 252 -277.
7E. Wiener. A neural network approach to topic spotting. The 4th Annual Syrup. on Document Analysis and Information Retrieval,Las Vegas, NV, 1995.
8R. E. Schapire, Y. Singer. Improved boosting algorithms using confidence-rated predications. In: Proc. of the 11th Annual Conf.on Computational Learning Theory. New York: ACM Press,1998. 80--91.
9T. Joachims. Text categorization with support vector machines:Learning with many relevant features. In: Proc. of the 10th European Conf. on Machine Learning. New York: Springer,1998. 137-142.
10Y. Yang. An evaluation of statistical approaches to text categorization. Information Retrieval, 1999, 1 ( 1 ) : 76-- 88.

共引文献391

1施洁斌.基于支持向量机的文本自动分类试验研究[J].现代图书情报技术,2004(7):27-29.
2苏盛,K.K.Li,曾祥君,陈超强,陈兴宇.通用变电站操作票生成方法的研究[J].电网技术,2004,28(14):15-18. 被引量：8
3杨公平,曾广周,卢朝霞.迁移工作流系统中停靠站服务器的设计与实现[J].计算机工程与应用,2004,40(19):111-112. 被引量：5
4周远成,乞建勋,张立辉.网络计划优化技术中顺序优化的编程模式与算法设计[J].运筹与管理,2004,13(5):47-50. 被引量：3
5周书葵,许仕荣.城市供水管网水质监测点优化选址的研究[J].南华大学学报（自然科学版）,2004,18(3):62-66. 被引量：9
6陈文庆,李勤,姚伽华.基于最大熵模型的垃圾邮件过滤方法[J].网络安全技术与应用,2005(1):16-18. 被引量：1
7吴一民,徐建闽,胡郁葱.一种基于层次图模型的最优路径算法[J].计算机工程与设计,2005,26(2):317-319. 被引量：8
8陈智斌,余永权,杨少敏.基于可拓学的球墨铸铁石墨形态识别[J].中国工程科学,2005,7(3):84-89. 被引量：3
9刘晓利,秦奋涛.有向图的强连通性分析及判别算法[J].计算机应用与软件,2005,22(4):138-139. 被引量：7
10胡佳妮,徐蔚然,郭军,邓伟洪.中文文本分类中的特征选择算法研究[J].光通信研究,2005(3):44-46. 被引量：47

同被引文献14

1Choi　Suk-OO.利用叙词表开发本体[J].数字图书馆论坛,2007(5):18-23. 被引量：4
2Baeza - Yates R, Ribeiro - Neto B. Modern Information Retrieval [ M ] .王知津,等译.北京:机械工业出版社,2005:7-10.
3岸田和明,武者小路橙子,稻垣几世枝,等.シソーラスの比较评价:概念体系の提示の性能を中心に[J].情报の科学上技术.1988,38(10):565-572.
4Kando - Matsuyama Noriko, Kishida Kazuaki, Mushakoji Sumiko, et al. A comparative evaluation of thesauri concerning "conceptual representability" through an indexing experiment of the documents on library and information science [J]. Library and Information Science, 1988(26) : 103- 114.
5Narang S P. A comparative study of selected information retrieval thesauri in the engineering field [ D ]. Loughborough University of Technology. 1988.
6Milstead J L. ASIS thesaurus of information science and librarianship[M] . 2nd ed. Medford, NJ: Learned Information, 2005.
7Cambridge Science Abstracts. Library and information seienee abstract thesaurus [ EB/OL ]. [ 2010 - 04 - 15 ]. http ://www.
8EBSCO. Library,information science & technology abstracts thesaurus [ EB/OL ] . [ 2010 - 04 - 19 ] . http ://web. ebscohost com/ehost/thesaurus?vid = 3 & hid = 11 & sid = 8a0abf04 - 2eb2- 4896- a60d - 76dac90b44e3% 40sessionmgrl3.
9Wilson Web. Library literature full text thesaurus [ EB/OL]. [ 2010 - 04 - 10] . http ://www. wilsonweb, com.
10Neelameghan A, Rao I K R. Non - hierarchical associative relations : Their types and computer - generation of RT links [J]. Library Science with a Slant to Documentation, 1976, 13(1) : 24- 34.

引证文献1

1王娟,孙爱莉,王海雄,蒋永新.图情学主题词表分类体系评价[J].情报资料工作,2011,32(4):54-57.

1陆霞.基于二叉键树的多模式匹配算法的研究[J].电脑知识与技术（过刊）,2010,0(15):4302-4304.
2唐皓,卢显良.基于改进双链树的多模式匹配算法[J].计算机应用,2005,25(2):365-366. 被引量：2
3刘新,刘任任.一种基于逆向匹配算法的中文文本分类技术[J].计算机应用,2008,28(4):945-947. 被引量：3
4熊志斌.一种键树结构的中文分词方法[J].电脑编程技巧与维护,2012(20):97-99.
5郑捷,王晓东.键树在网络防火墙设计中的应用[J].福建电脑,2003,19(12):34-35. 被引量：1
6何金凤.基于键树的权限检测机制[J].民航科技,2011(1):22-24.
7宋晔,张敏.基于文本分类的搜索引擎的设计与实现[J].软件导刊,2012,11(8):87-88.
8王禄.多键值字符串键树的原理及实现[J].电脑知识与技术（过刊）,2007(2):408-410.
9刘海保,程小辉.嵌入式系统中文输入法的设计与实现[J].现代计算机,2007,13(5):73-75. 被引量：1
10李宁.用C#2．0实现网络蜘蛛[J].电脑编程技巧与维护,2008(4):48-55. 被引量：1

情报学报

2008年第3期

浏览历史

内容加载中请稍等...

一种基于主题词表的快速中文文本分类技术被引量：1

参考文献10

二级参考文献30

共引文献391

同被引文献14

引证文献1

相关作者

相关机构

相关主题

浏览历史

一种基于主题词表的快速中文文本分类技术 被引量：1

参考文献10

二级参考文献30

共引文献391

同被引文献14

引证文献1

相关作者

相关机构

相关主题

浏览历史

一种基于主题词表的快速中文文本分类技术被引量：1