基于代表样本动态生成的快速文本分类

Fast Text Classification Based on Dynamical Generation of Representative Samples

下载PDF

导出

摘要 κ-近邻作为一种简单、有效、非参数的分类方法,在文本分类中得到广泛的应用,但是这种方法计算量较大。针对κ-近邻法的不足之处,提出了一种新的快速文本分类方法,通过对原始训练样本集的训练生成代表样本,再根据原始训练样本与已生成代表样本之间的分布状况,对已生成的代表样本进行多次调整,从而使代表样本更具有代表性。这种方法有效地压缩了原始训练样本集,提高了分类效率;同时,由于代表样本的分布更加合理,可以提高分类的准确性。实验结果显示,此方法具有很好的分类性能。 As a simple, effective and nonparametric classification method, k- Nearest Neighbor method is widely used in text classification, but it has large computational demands. In this paper a new fast text classification approach is proposed to solve the problem. The method generates representative samples through training the original samples, and then adjusts the representative samples repeatedly for enhancing its representative ability according to the distribution of the original training samples and generated representative samples. By using this approach, the original training corpus can be compressed effectively so that the classification efficiency can be improved substantially. Meanwhile, this approach makes the distribution of representative samples more even, so the classification performance can be improved. Experiments also show that this approach has a good performance.

作者华北曹先彬

机构地区中国科学技术大学计算机科学技术系安徽省计算与通讯软件重点实验室

出处《计算机仿真》 CSCD 2007年第6期322-325,共4页 Computer Simulation

基金国家自然科学基金(60204009) 中科院复杂系统与智能科学重点实验室开放基金(20040104) 973课题(2004CB318109)。

关键词文本分类代表样本快速分类 Text classification Representative samples Fast classification

分类号 TP391 [自动化与计算机技术—计算机应用技术] TP18 [自动化与计算机技术—控制理论与控制工程]

引文网络
相关文献

参考文献9

1Y YANG,X LIN.A re-examination of text categorization methods[C].The 22th Annual Int ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'99),New York:ACM Press,1999.42-49.
2S O Belkasim,M Shridhar,M Ahmadi.Pattern classification using an efficient KNNR[J].Pattern Recognition,1992,25(10):1269-1274.
3乔玉龙,潘正祥,孙圣和.一种改进的快速k-近邻分类算法[J].电子学报,2005,33(6):1146-1149. 被引量：25
4李杨,曾海泉,刘庆华,胡运发.基于kNN的快速WEB文档分类[J].小型微型计算机系统,2004,25(4):725-729. 被引量：13
5P E Hart.The condensed nearest neighbor rule[J].IEEE Trans on Information Theory,1968,IT-14(3):515-516.
6D L Wilson.Asymptotic properties of nearest neighbor rules using edited data[J].IEEE Trans on systems,Man and Cybernetics,1972,2(3):408-421.
7P Devijver,J Kittler.Pattern Recognition:A Statistical Approach[M].Englewood Cliffs:Prentice Hall,1982.
8李荣陆,胡运发.基于密度的kNN文本分类器训练样本裁剪方法[J].计算机研究与发展,2004,41(4):539-545. 被引量：98
9Shuigeng Zhou,Tok Wang Ling,Jihong Guan,Jiangtao Hu,Aoying Zhou.Fast text classification:a training-corpus pruning based approach[C].Proceedings of the 8th International Conference on Database Systems for Advanced Application(DASFAA 2003),IEEE CS,March 26-28,Kyoto,Japan,pp.127-136.

二级参考文献39

1[1]Yang Y and Liu X. A re-examination of text categorization methods[C]. In: Proceedings of ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 1999, 42～49.
2[2]Dasarathy B V. Neatest neighbor(NN) norms: NN pattern classification techniques[C]. Los Alamitos, CA:IEEE Computer Society Press, 1991.
3[3]Harrt P E. The condensed nearest neighbor rule[J]. IEEE Trans. Information Theory ,May 1968,IT-14(3):515～516.
4[4]Dasarathy Y, Minimal B V. Consistent set (MCS) identification for optimal nearest neighbor decision system terms design[J]. IEEE Trans. Syst. Man Cybern. ,March 1994,24(3):511～517.
5[5]Kuncheva L I. Fitness functions in editing K-NN reference set by genetic algorithms[J]. Pattern Rcognition,1997,30(6):1041～1049.
6[6]Zhong Hong-bin, Sun Guang-yu. Optimal selection of & Technology, May 2001,16(2): 126～136.reference set for the nearest neighbor classification by Tabu search[J]. Journal of Computer Science
7[7]Masand B, Linoff G and Waltz D. Classifying news stories using memory-based reasoning[C]. In: International ACM SIGIR Conference on Research and Development in Information Retrieval, 1992, 59～65.
8[8]Yang Y. Expert network: effective and efficient learning from human decisions in text categorization and retrieval[C]. In:Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'94) 1994:11～21.
9[9]Iwayama M and Tokunaga T. Cluster-based text categorization: a comparison of category search strategies[C]. In: Proceedings of the 18th Ann Int ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'95), 1995, 273～281.
10[10]Yang Y and Pederson J. Feature selection in statistical learning of text categorization[C]. In: ICML-97, 1997,412～420.

共引文献124

1姚学恒,张萍,闫立伟,操诚.基于机器学习的企业秘密文档自动分类方法[J].产业与科技论坛,2020,19(7):44-45.
2郑凌铭,舒胜文,陈彬,吴涵,黄建业,钱健.强台风环境下基于格点化和支持向量机的10 kV杆塔受损量预测方法[J].高电压技术,2020,46(1):42-51. 被引量：11
3李荣陆,王建会,陈晓云,陶晓鹏,胡运发.使用最大熵模型进行中文文本分类[J].计算机研究与发展,2005,42(1):94-101. 被引量：95
4张国英,沙芸,江慧娜.基于粒子群优化的快速KNN分类算法[J].山东大学学报（理学版）,2006,41(3):120-123. 被引量：8
5华北,曹先彬.基于代表样本动态生成的中文网页分类[J].计算机应用,2006,26(10):2502-2504. 被引量：2
6李订芳,胡文超,何炎祥.基于共享最近邻聚类和模糊集理论的分类器[J].控制与决策,2006,21(10):1103-1108. 被引量：5
7谭磊,张桦,薛彦斌.一种基于特征点的图像匹配算法[J].天津理工大学学报,2006,22(6):66-69. 被引量：11
8王煜,白石,王正欧.用于Web文本分类的快速KNN算法[J].情报学报,2007,26(1):60-64. 被引量：33
9屈军,林旭.文本分类中特征提取方法的比较与分析[J].现代计算机,2007,13(4):10-13. 被引量：8
10印鉴,谭焕云.基于χ~2统计量的kNN文本分类算法[J].小型微型计算机系统,2007,28(6):1094-1097. 被引量：13

1华北,曹先彬.基于代表样本动态生成的中文网页分类[J].计算机应用,2006,26(10):2502-2504. 被引量：2
2杨为民,李龙澍.基于Agent的文本分类系统[J].计算机技术与发展,2007,17(2):135-137. 被引量：2
3王正群,侯艳平,邹军,马波.改进的特征选择算法[J].计算机工程与设计,2008,29(22):5814-5816. 被引量：2
4李村合,冯静.一种改进的KNN网页分类算法[J].微计算机应用,2008,29(3):21-25. 被引量：3
5毕凯,周炜,蒋玉娇,安和平.基于改进ReliefF算法的Honeynet告警日志分析[J].计算机工程与设计,2011,32(7):2237-2240. 被引量：1
6桑军,胡海波,叶春晓,向宏,傅鹂,蔡斌.基于动态聚类及样本筛选的人脸识别[J].计算机工程与应用,2008,44(23):191-192. 被引量：2
7王友卫,刘元宁,凤丽洲,朱晓冬.基于用户兴趣度的垃圾邮件在线识别新方法[J].华南理工大学学报（自然科学版）,2014,42(7):21-27. 被引量：4
8谢宏威,张宪民,邝泳聪,欧阳高飞.印刷电路板焊点的智能检测[J].光学精密工程,2011,19(9):2154-2162. 被引量：8
9李东晖,杜树新,吴铁军.基于壳向量的线性支持向量机快速增量学习算法[J].浙江大学学报（工学版）,2006,40(2):202-206. 被引量：15
10郑朝晖,裘聿皇,陈峻峰.一种印刷体字符识别的新方法:基于遗传算法的(0,1,*)-矩阵法[J].控制与决策,2001,16(3):296-298. 被引量：5

计算机仿真

2007年第6期

浏览历史

内容加载中请稍等...

基于代表样本动态生成的快速文本分类

参考文献9

二级参考文献39

共引文献124

相关作者

相关机构

相关主题

浏览历史