摘要
在最近召开的"全国搜索引擎与网上信息挖掘学术研讨会"上,举办了一场"中文网页自动分类竞赛",共有来自全国各地的10个队参加。本文在介绍本次竞赛活动规则和过程的基础上,详细分析了竞赛的结果,从而使我们对于目前中文网页自动分类技术的现状有了一种具体的认识:目前已有分类器的性能没有呈现出明显的差距,中文网页的分类比普通文本的分类要困难的多。同时,本文还尝试推出一个标准的中文网页分类的实例样本集,希望通过不断完善,最终作为中文网页分类技术研究的基本语料。
A Chinese Web page automatic categorization contest was hold in national symposium on Search Engine and Web Mining and ten teams took part in this contest. After describing the contest rules, this paper analyses the contest results in details and we can have an explicit view on the present technologies of Chinese Web page automatic categorization: no explicit difference is shown among those classifiers had been developed and Chinese Web page categorization is more difficult than plain text categorization. This paper also attempt to provide a standard Chinese Web page categorization instance examples and develops them to be a base corpus of Chinese Web page categorization by continuous modification.
出处
《中文信息学报》
CSCD
北大核心
2003年第5期34-40,共7页
Journal of Chinese Information Processing
基金
国家973重大基础研究项目资助(G1999032706)
关键词
计算机应用
中文信息处理
机器学习
中文网页自动分类
TREC评测
computer application
Chinese information processing
machine learning
Chinese Web page automatic categorization
TREC evaluation