摘要
网页分类中存在类别多、训练样本少等问题,一般分类器训练应用效果不佳。为了解决这个问题,提出基于类中心的统计学习方法。在较少人工标注网页的训练集情况下,此方法能取得很好的分类性能并且大幅度加快训练时间,并可以通过加入网页层次目录信息提升推理速度。在第一届LSHTC评测数据集上进行实验,结果表明:基于类中心的统计学习方法拥有较快的训练以及推理速度,并且在正确率上有很强的竞争力。
There are such problems in web page classification as involving too many categories and too few training samples, so that normal classifiers perform poor in applications. To solve the problem, centroid-based classification method is presented. Centroid-based algorithm not only achieves very good classification performance with fewer manual annotation tags, but also significantly improves training speed and prediction speed by adding web page hierarchical category information. By comparing with other methods that participated in 1 st LSHTC evaluation, experimental results show that centroid-based algorithm can get a very fast training and prediction speed with competitive accuracy.
出处
《计算机应用与软件》
CSCD
北大核心
2012年第7期260-263,281,共5页
Computer Applications and Software
关键词
类中心
文本分类
统计学习
Centroid-based Text classification Statistic learning