期刊文献+

一种基于预分类的高效SVM中文网页分类器 被引量:19

Efficient SVM Chinese Web page classifier based on pre-classification
下载PDF
导出
摘要 中文网页分类技术是数据挖掘研究中的一个热点领域,而支持向量机(SVM)是一种高效的分类识别方法。首先给出了一个基于SVM的中文网页自动分类系统模型,详细介绍了分类过程中涉及的一些关键技术,其中包括网页预处理、特征选择和特征权重计算等。提出了一种利用预置关键词表进行预分类的方法,并详细说明了该方法的原理与实现。实验结果表明,该方法与单独使用SVM分类器相比,不仅大大减少了分类时间,准确率和召回率也明显提高。 Chinese Web page classification has been considered as a hot research area in data mining,and SVM is an effective method for learning the classification knowledge from massive data.In this paper,a model of automatic Chinese Web page classification system based on SVM is presented first.Then detailed design and implementation are introduced,and some key techniques about Chinese Web page classification,including Web page pre-processing,feature selection and weight computing are discussed.A pre-classification method by a given keywords list is proposed,and the principles and detailed implementation are described.The experiment shows that it not only reduces time but also increases in precision and recall compared with using SVM classifier only.
出处 《计算机工程与应用》 CSCD 北大核心 2010年第1期125-128,共4页 Computer Engineering and Applications
关键词 支持向量机 中文网页分类 文本分类 机器学习 support vector machine Chinese Web page classification text classification machine learning
  • 相关文献

参考文献11

  • 1Yang Yi-ming.An evaluation of statistical approaches to text categorization[J].Information Retrieval, 1999,1 ( 1 ) : 76-88.
  • 2Yang Yi-ming,Slattery S,Ghani R.A study of approaches to hypertext categorization [J].J Intelligent Information System, 2002,18 (2/3):219-241.
  • 3庞剑锋,卜东波,白硕.基于向量空间模型的文本自动分类系统的研究与实现[J].计算机应用研究,2001,18(9):23-26. 被引量:293
  • 4Attardi G,Gull A,Sebastiani F.Automatic Web page categorization by link and.context analysis[C]//Proceedings of 1st European Symposium on Telematics,Hypermedia and Artificial Intelligence, (Varese, IT ), 1999.
  • 5Shih L K,Karger D R.Using URLs and table layout for Web classification tasks[C]//Proceedings of the 13th International Conference on World Wide Web,200g.
  • 6Shen Dou,Chen Zheng,Yang QianglWeb-page classification through summarization[C]//Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 2004 : 210-213.
  • 7单松巍,冯是聪,李晓明.几种典型特征选取方法在中文网页分类上的效果比较[J].计算机工程与应用,2003,39(22):146-148. 被引量:76
  • 8徐凤亚,罗振声.文本自动分类中特征权重算法的改进研究[J].计算机工程与应用,2005,41(1):181-184. 被引量:56
  • 9朱慕华,朱靖波,陈文亮.面向文本分类的多类别SVM组合方式的比较[c]//全国第八届计算语言学联合学术会议,2005:435-441.
  • 10Lin C J,Weng R C,Keerthi S S.Trust region Newton method for large-scale logistic regression[R/OL].2007.http://www.csie.ntu.edu. tw/-cjlirdliblinear.

二级参考文献22

  • 1黄萱青 吴立德.独立于语种的文本分类方法[M].,2000.37-43.
  • 2鲁松 白硕 等.文本中词语权重计算方法的改进[M].,2000.31-36.
  • 3卜东波.聚类/分类理论研究及其在大模型文本挖掘的应用:博士论文[M].,2000..
  • 4冯是聪 单松巍 张志刚 等.一个中文网页数据集及其分类体系[A]..海峡两岸技术交流会[C].南京,2002-10.121-129.
  • 5Yiming Yang,Jan O Pedersen.A comparative Study on Feature Selection in Text Categorization[C].In :Proceedings of the Fourteenth International Conference on Machine Leaming(ICML'97), 1997.
  • 6Yiming Yang,Xin Liu.A re-examination of text categorization methods[C].In:Proceedings of ACM SIGIR Conference on Research and Development in Information Retrieval SIGIR'99,1999:42---49.
  • 7Yiming Yang.A study on thresholding strategies for text categorization[C].In:Proceedings of ACM SIGIR Conference on Research and Development in Information Retrieval(SIGIR'01),2001.
  • 8James Auen.Natural Language Understandin[M].The Benjamin/Cummings Publishing Company, 1991-05.
  • 9Apte C,Damerau F J,Weiss S M.Automated Learning of Decision Rules for Text Categorization[J].ACM Trans On Inform Syst,12(3): 233-251.
  • 10Salton G,Buckley B.Term-weighting Approaches in Automatic Text Retrieval[J].Information Processing and Management, 1998 ; 24(5 ) :513 -523.

共引文献417

同被引文献199

引证文献19

二级引证文献61

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部