期刊文献+

计算文本相似度阈值的方法 被引量:18

Computing similarity threshold for text classification
原文传递
导出
摘要 基于VSM(向量空间模型)的相似度分类器的相似度阈值通常由经验确定导致分类精度不高。该文提出一种基于Boosting机制在不同文档集上自动计算相似度阈值的方法。它利用Boosting迭代生成多个基于相似度划分的子分类器,通过加权把决定这些子分类器的相似度阈值组合起来,得到对理想相似度阈值的一种逼近。实验表明:这样得到的相似度分类器的平均精度比传统方法高15%左右,甚至可以与一些复杂方法相比。它在处理网络实时文本信息处理问题(分类、过滤和检索)中的效率是这些复杂方法的3倍以上,且问题规模越大、越复杂,其优势越大。 The VSM (vector space model) based similarity classifier is a simple and popular text categorization method. However, since its similarity threshold is always set empirically, the accuracy of the similarity classifier is generally not good. A boostingbased mechanism was developed to adaptively compute a similarity threshold for different datasets to improve the accuracy. The process first generates a certain number of similaritypartitionbased subclassifiers via boosting iterations, and then combines their individual similarity thresholds with weighting, as an approximation to the real similarity threshold. Tests showed that this similarity classifier was about 15% more accurate than traditional similarity classifiers, and was comparable to some complex classification methods but with an efficiency at least 3 times better than the complex methods for realtime text classification, filtering and retrieval problems from the Internet. The advantages of this method are even more pronounced for more complex, larger problems.
出处 《清华大学学报(自然科学版)》 EI CAS CSCD 北大核心 2003年第1期108-111,共4页 Journal of Tsinghua University(Science and Technology)
基金 国家自然科学基金资助项目(79990580) 国家重点基础研究发展规划项目(G1998030414)
关键词 相似度阈值 数据挖掘 文本挖掘 文本分类 Boosting机制 向量空间模型 计算方法 data mining text mining text categorization boosting learning similarity
  • 相关文献

参考文献1

二级参考文献5

  • 1[1]Freund, Y., Schapire, R. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 1997,55(1):119~139.
  • 2[2]Breiman, L., Friedman, J., Olshen, R., et al. Classification and Regression Trees. Belmont, CA: Wadsworth, 1984. 1~357.
  • 3[3]Schapire, R., Singer, Y. BoosTexter: a boosting-based system for text categorization. Machine Learning, 2000,39(2/3):135~168.
  • 4[4]Salton, G., Wong, A., Yang, C. A vector space model for automatic indexing. Communications of the ACM, 1995,18:613~620.
  • 5[5]Schapire, R., Singer, Y. Improved boosting algorithms using confidence-related predictions. Machine Learning, 1999,37(3): 297~336.

共引文献14

同被引文献172

引证文献18

二级引证文献71

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部