基于代表样本动态生成的中文网页分类被引量：2

Chinese Web page classification based on representative samples dynamical generation

下载PDF

导出

摘要针对中文网页分类问题该文设计了一种新的基于代表样本动态生成的分类算法。算法通过对原始训练样本集的训练逐个生成代表样本,并充分利用被裁剪训练样本的有效信息,对已生成的代表样本进行多次调整,从而使代表样本更具有代表性。基于该算法的中文网页分类器的实验结果表明,算法有效地压缩了原始训练样本集,提高了分类效率,同时保持了分类的准确性;具有较好的分类性能。 A new algorithm based on representative samples dynamical generation for Chinese Web page classification was proposed In this paper. The method generated representative samples through training the original samples; and then made the best of helpful information from every sample which was cut out to adjust the representative samples repeatedly in order to enhance the representativeness. Through the experiment with the Chinese Web classifier based on this algorithm, it shows that this algorithm can compress the original training corpus effectively so that classification efficiency can be improved substantially; meanwhile, this algorithm maintains the accuracy and has a better classification performance.

作者华北曹先彬

机构地区中国科学技术大学计算机科学技术系安徽省计算与通讯软件重点实验室

出处《计算机应用》 CSCD 北大核心 2006年第10期2502-2504,共3页 journal of Computer Applications

基金国家自然科学基金资助项目(60204009) 国家973规划项目(2004CB318109) 中科院复杂系统与智能科学重点实验室开放基金(20040104)

关键词 K-近邻代表样本调整 k-Nearest Neighbor representative samples adjustment

分类号 TP391 [自动化与计算机技术—计算机应用技术] TP18 [自动化与计算机技术—控制理论与控制工程]

引文网络
相关文献

参考文献8

1YANG Y,LIN X.A re-examination of text categorization methods[A].The 22th Annual Int ACM SIGIR Conference on Research and Development in Information Retrieval(SIGIR'99)[C].New York:ACM Press,1999.42-49.
2LEWIS DD.Naive (Bayes) at forty:The independence assumption in information retrieval[A].The 10th European Conf on Machine Learning(ECML98)[C].New York:Springer-Verlag,1998.4-15.
3SEBASTIANI F.Machine learning in automated text categorization[J].ACM Computing Surveys,2002,34(1):1-47.
4JOACHIMS T.Text categorization with support vector machines:Learning with many relevant features[A].The 10th European Conf on Machine Learning(ECML-98)[C].Berlin:Springer,1998.137-142.
5李晓黎,刘继敏,史忠植.基于支持向量机与无监督聚类相结合的中文网页分类器[J].计算机学报,2001,24(1):62-68. 被引量：108
6贺海军,王建芬,周青,曹元大.基于决策支持向量机的中文网页分类器[J].计算机工程,2003,29(2):47-48. 被引量：19
7李荣陆,胡运发.基于密度的kNN文本分类器训练样本裁剪方法[J].计算机研究与发展,2004,41(4):539-545. 被引量：98
8ZHOU SG,LING TW,GUAN JH,et al.Fast text classification:a training-corpus pruning based approach[A].Proceedings of the 8th International Gonference on Database Systems for Advanced Application(DASFAA 2003)[G].IEEE GS,March 26 -28,Kyoto,Japan,2003.127-136.

二级参考文献16

1[1]D D Lewis. Naive (Bayes) at forty: The independence assumption in information retrieval. In: The 10th European Conf on Machine Learning(ECML98), New York: Springer-Verlag, 1998. 4～15
2[2]Y Yang, X Lin. A re-examination of text categorization methods. In: The 22nd Annual Int'l ACM SIGIR Conf on Research and Development in Information Retrieval, New York: ACM Press, 1999
3[3]Y Yang, C G Chute. An example-based mapping method for text categorization and retrieval. ACM Trans on Information Systems, 1994, 12(3): 252～277
4[4]E Wiener. A neural network approach to topic spotting. The 4th Annual Symp on Document Analysis and Information Retrieval (SDAIR 95), Las Vegas, NV, 1995
5[5]R E Schapire, Y Singer. Improved boosting algorithms using confidence-rated predications. In: Proc of the 11th Annual Conf on Computational Learning Theory. Madison: ACM Press, 1998. 80～91
6[6]T Joachims. Text categorization with support vector machines: Learning with many relevant features. In: The 10th European Conf on Machine Learning (ECML-98). Berlin: Springer, 1998. 137～142
7[7]S O Belkasim, M Shridhar, M Ahmadi. Pattern classification using an efficient KNNR. Pattern Recognition Letter, 1992, 25(10): 1269～1273
8[8]V E Ruiz. An algorithm for finding nearest neighbors in (approximately) constant average time. Pattern Recognition Letter, 1986, 4(3): 145～147
9[9]P E Hart. The condensed nearest neighbor rule. IEEE Trans on Information Theory, 1968, IT-14(3): 515～516
10[10]D L Wilson. Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans on Systems, Man and Cybernetics, 1972, 2(3): 408～421

共引文献218

1姚学恒,张萍,闫立伟,操诚.基于机器学习的企业秘密文档自动分类方法[J].产业与科技论坛,2020,19(7):44-45.
2郑凌铭,舒胜文,陈彬,吴涵,黄建业,钱健.强台风环境下基于格点化和支持向量机的10 kV杆塔受损量预测方法[J].高电压技术,2020,46(1):42-51. 被引量：11
3张莉.网页自动分类技术概念分析[J].娄底职业技术学院学报（职教与经济研究）,2007(2):58-62.
4王世卫,李爱国.报税欺诈检测研究[J].仪器仪表学报,2005,26(z1):900-901.
5童亚拉,陈益.一种基于混沌粒子群算法的网页分类规则抽取方法[J].微电子学与计算机,2009,26(2):193-196. 被引量：2
6张明,龙鹏飞.基于聚类、粗糙集和支持向量机的故障诊断[J].微机发展,2004,14(8):38-40. 被引量：1
7郑松峰,徐维朴,刘维湘,郑南宁.基于无监督聚类的约简支撑向量机[J].计算机工程与应用,2004,40(14):74-76. 被引量：1
8高洁,吉根林.文本分类技术研究[J].计算机应用研究,2004,21(7):28-30. 被引量：36
9贾自艳,何清,张海俊,李嘉佑,史忠植.一种基于动态进化模型的事件探测和追踪算法[J].计算机研究与发展,2004,41(7):1273-1280. 被引量：58
10江宝林,申展,张川,葛家翔,胡运发.结合网站内容和结构进行的Web日志挖掘[J].计算机工程,2004,30(16):30-32. 被引量：9

同被引文献19

1乔玉龙,潘正祥,孙圣和.一种改进的快速k-近邻分类算法[J].电子学报,2005,33(6):1146-1149. 被引量：25
2张高胤,谭成翔,汪海航.基于K-近邻算法的网页自动分类系统的研究及实现[J].计算机技术与发展,2007,17(1):21-23. 被引量：2
3Craig Silverstein, Monika Henzinger. Analysis of a Very Large Web Search Engine Query Log. SIGIR Forum, 1999.
4Yang Y, Li X. A re -examination of text categorization method. In: Proceedings of ACM SIGIR Conference on Research and Development in Information Retrieval,1999.
5S Tan. An effective refinement strategy for KNN text classifier. Expert Systems with Applications, 2006 Elsevier.
6L Baoli, L Qin,Y Shiwen. An Adaptive k -Nearest Neighbor Text Categorization Strategy ACM Transactions on Asian Language Information Processing ( TALIP), 2004,3 (4).
7Yang Y, Liu X. A re-examination of text categorization methods[C]//Proceedings of ACM SIGIR Conference on Research and Development in Information Retrieval(SIGIR'99). Berkley, 1999..42-49.
8P Langley, W Iba, K Thompson. An analysis of Bayesian classifiers[C]// National Conference on Artificial Intelligence, 1992: 223-228.
9Furnkranz J. Exploiting structural information for text classification on the WWW[A]//IDA'99. Amsterdam: Springer Verlag, 1999: 487-497.
10Slattery S. Hypertext Classification[D]. Pittsburgh:Carnegie Mellon University, 2001.

引证文献2

1李村合,冯静.一种改进的KNN网页分类算法[J].微计算机应用,2008,29(3):21-25. 被引量：3
2李旻,杜海顺,王琪.基于KNC算法的中文网页分类方法研究[J].河南大学学报（自然科学版）,2010,40(5):529-532.

二级引证文献3

1刘锋,白凡.一种改进的K近邻算法在网页分类中的应用[J].电子技术（上海）,2010(7):30-31. 被引量：1
2金一宁,王华兵,王德峰.基于KNN及相关链接的中文网页分类研究[J].哈尔滨商业大学学报（自然科学版）,2011,27(2):203-207. 被引量：2
3李勇.中文网页分类研究综述[J].现代计算机,2012,18(15):3-7. 被引量：1

1华北,曹先彬.基于代表样本动态生成的快速文本分类[J].计算机仿真,2007,24(6):322-325.
2王正群,侯艳平,邹军,马波.改进的特征选择算法[J].计算机工程与设计,2008,29(22):5814-5816. 被引量：2
3李村合,冯静.一种改进的KNN网页分类算法[J].微计算机应用,2008,29(3):21-25. 被引量：3
4贺海军,王建芬,周青,曹元大.基于决策支持向量机的中文网页分类器[J].计算机工程,2003,29(2):47-48. 被引量：19
5毕凯,周炜,蒋玉娇,安和平.基于改进ReliefF算法的Honeynet告警日志分析[J].计算机工程与设计,2011,32(7):2237-2240. 被引量：1
6桑军,胡海波,叶春晓,向宏,傅鹂,蔡斌.基于动态聚类及样本筛选的人脸识别[J].计算机工程与应用,2008,44(23):191-192. 被引量：2
7李晓黎,刘继敏,史忠植.基于支持向量机与无监督聚类相结合的中文网页分类器[J].计算机学报,2001,24(1):62-68. 被引量：108
8王友卫,刘元宁,凤丽洲,朱晓冬.基于用户兴趣度的垃圾邮件在线识别新方法[J].华南理工大学学报（自然科学版）,2014,42(7):21-27. 被引量：4
9谢宏威,张宪民,邝泳聪,欧阳高飞.印刷电路板焊点的智能检测[J].光学精密工程,2011,19(9):2154-2162. 被引量：8
10李东晖,杜树新,吴铁军.基于壳向量的线性支持向量机快速增量学习算法[J].浙江大学学报（工学版）,2006,40(2):202-206. 被引量：15

计算机应用

2006年第10期

浏览历史

内容加载中请稍等...

基于代表样本动态生成的中文网页分类被引量：2

参考文献8

二级参考文献16

共引文献218

同被引文献19

引证文献2

二级引证文献3

相关作者

相关机构

相关主题

浏览历史

基于代表样本动态生成的中文网页分类 被引量：2

参考文献8

二级参考文献16

共引文献218

同被引文献19

引证文献2

二级引证文献3

相关作者

相关机构

相关主题

浏览历史

基于代表样本动态生成的中文网页分类被引量：2