期刊文献+

关于“中文网页自动分类竞赛”结果的分析 被引量:6

The Analysis of a Contest Result on Chinese Web Page Automatic Categorization
下载PDF
导出
摘要 在最近召开的"全国搜索引擎与网上信息挖掘学术研讨会"上,举办了一场"中文网页自动分类竞赛",共有来自全国各地的10个队参加。本文在介绍本次竞赛活动规则和过程的基础上,详细分析了竞赛的结果,从而使我们对于目前中文网页自动分类技术的现状有了一种具体的认识:目前已有分类器的性能没有呈现出明显的差距,中文网页的分类比普通文本的分类要困难的多。同时,本文还尝试推出一个标准的中文网页分类的实例样本集,希望通过不断完善,最终作为中文网页分类技术研究的基本语料。 A Chinese Web page automatic categorization contest was hold in national symposium on Search Engine and Web Mining and ten teams took part in this contest. After describing the contest rules, this paper analyses the contest results in details and we can have an explicit view on the present technologies of Chinese Web page automatic categorization: no explicit difference is shown among those classifiers had been developed and Chinese Web page categorization is more difficult than plain text categorization. This paper also attempt to provide a standard Chinese Web page categorization instance examples and develops them to be a base corpus of Chinese Web page categorization by continuous modification.
出处 《中文信息学报》 CSCD 北大核心 2003年第5期34-40,共7页 Journal of Chinese Information Processing
基金 国家973重大基础研究项目资助(G1999032706)
关键词 计算机应用 中文信息处理 机器学习 中文网页自动分类 TREC评测 computer application Chinese information processing machine learning Chinese Web page automatic categorization TREC evaluation
  • 相关文献

参考文献3

二级参考文献18

  • 1李晓明,刘建国.搜索引擎技术及趋势[J].中国计算机用户,2000(9):27-28. 被引量:14
  • 2祝福来.北大天网发布2002年中国网页调查报告[N].计算机世界,2003-01-27,A6版.
  • 3[2]赵江华,闫宏飞,王建勇等. 天网中的并行与分布处理. 北京大学,技术报告:PKU CS NET TR2002001, 2002. Http://162.105.80.88/crazysite/home/report(Zhao Jianghua, Yan Hongfei, Wang Jianyong et al. Parallel and distributed processing in WebGather(in Chinese). Peking University, Tech Rep: PKU CS NET TR2002001, 2002.Http://162.105.80.88/crazysite/home/report)
  • 4[3]Yan Hongfei, Wang Jianyong, Li Xiaoming. A dynamically reconfigurable model for a distributed web crawling system. In: 2001 Int'l Conf Computer Networks and Mobile Computing. Beijing, 2001. 157~162
  • 5[4]Marc Najork, Janet L Wiener. Breadth-first search crawling yields high-quality pages. In: Proc of the 10th Int'l World Wide Web Conf. Hongkong, 2001. 114~118
  • 6[5]Li Xiaoming, Wang Jianyong. WebGather: Towards quality and scalability of a web search service. In: Proc of the 10th Int'l World-Wide Web Conf. Hongkong, 2001
  • 7[7]中国互联网络信息中心(CNNIC). 信息服务. 2000. http://www.nic.edu.cn/INFO/cindex.html(CNNIC. Information service(in Chinese), 2000. http://www.nic.edu.cn/INFO/cindex.html)
  • 8[9]Andrei Broder, Ravi Kumar, Farzin Maghoul et al. Graph structure in the web: Experiments and models. In: Proc of the 9th Int'l World-Wide Web Conf. Amsterdam, 2000. 309~320
  • 9[10]Reka Albert, Hawoong Jeong, Albert-Laszlo Barabasi. Internet: Diameter of the world-wide web. Nature, 1999, 401: 130~131
  • 10[11]S R Kumar, P Raghavan, S Rajagopalan et al. Trawling the Web for emerging cyber-communities. In Proc of the 8th Int'l World-Wide Web Conf. Toronto, Canada, 1999. http://www8.org/w8-papers/4a-search-mining/trawling/trawling.html

共引文献159

同被引文献54

  • 1樊友平,陈允平,孙婉胜,马笑潇,柴毅,黄席樾.基于主元分析和免疫聚类的双向特征数据压缩方法[J].系统仿真学报,2005,17(1):148-153. 被引量:7
  • 2许云,樊孝忠,张锋.一种不需分词的中文文本分类方法[J].北京理工大学学报,2005,25(9):778-781. 被引量:5
  • 3毛伟,徐蔚然,郭军.基于n-gram语言模型和链状朴素贝叶斯分类器的中文文本分类系统[J].中文信息学报,2006,20(3):29-35. 被引量:16
  • 4....http://trec.nist.gov/,,2005-08-25..
  • 5Han E,Karypis G.Centroid-based document classification analysis & experimental result[C]. PKDD,2000:116-123.
  • 6Tan Songbo, Cheng Xue-Qi, Moustafa M Ghanem. A novel refinement approach for text categorization[C].ACM CIKM,2005: 469-476.
  • 7Salton G, Wong A,Yang C,A vector space model for automatic indexing[J].Commutation of ACM, 1995,18:613 -620.
  • 8Freund Y, Schapire R E. A decision-theoretic generalization of on-line learning and an application to boosting [J]. Journal of Computer and System Sciences, 1997,55( 1): 119-139.
  • 9Schapire R, Singer Y.BoosTexter:a boosting based system for text categorization[J].Machine Learning, 2000,39(203): 135 - 168.
  • 10Krogh A,Vedelsby J. Neural network ensembles, ross validation, and active learning[C]. Tesauro G, Touretzky D S, Leen T K, et al. Advances in Neural Information Processing Systems 7, Cambridge, MA: MIT Press, 1995:231-238.

引证文献6

二级引证文献7

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部