摘要
基于VSM(向量空间模型)的相似度分类器的相似度阈值通常由经验确定导致分类精度不高。该文提出一种基于Boosting机制在不同文档集上自动计算相似度阈值的方法。它利用Boosting迭代生成多个基于相似度划分的子分类器,通过加权把决定这些子分类器的相似度阈值组合起来,得到对理想相似度阈值的一种逼近。实验表明:这样得到的相似度分类器的平均精度比传统方法高15%左右,甚至可以与一些复杂方法相比。它在处理网络实时文本信息处理问题(分类、过滤和检索)中的效率是这些复杂方法的3倍以上,且问题规模越大、越复杂,其优势越大。
The VSM (vector space model) based similarity classifier is a simple and popular text categorization method. However, since its similarity threshold is always set empirically, the accuracy of the similarity classifier is generally not good. A boostingbased mechanism was developed to adaptively compute a similarity threshold for different datasets to improve the accuracy. The process first generates a certain number of similaritypartitionbased subclassifiers via boosting iterations, and then combines their individual similarity thresholds with weighting, as an approximation to the real similarity threshold. Tests showed that this similarity classifier was about 15% more accurate than traditional similarity classifiers, and was comparable to some complex classification methods but with an efficiency at least 3 times better than the complex methods for realtime text classification, filtering and retrieval problems from the Internet. The advantages of this method are even more pronounced for more complex, larger problems.
出处
《清华大学学报(自然科学版)》
EI
CAS
CSCD
北大核心
2003年第1期108-111,共4页
Journal of Tsinghua University(Science and Technology)
基金
国家自然科学基金资助项目(79990580)
国家重点基础研究发展规划项目(G1998030414)