期刊文献+

文本分类中特征权重因子的作用研究 被引量:16

Research on Effects of Term Weighting Factors for Text Categorization
下载PDF
导出
摘要 在传统的基于向量空间的文本分类中,特征权重计算与特征选择过程完全割裂,特征选择函数的得分能反映特征的重要性,却未被纳入权重表示,造成特征表示不精确并影响分类性能。一些改进方法使用特征选择函数等修改TFIDF模型,提高了分类性能,但没有探究各权重因子如何影响分类的性能。该文以词频、逆文档频率及特征选择函数分别作为衡量特征的文档代表性、文档区分性及类别区分性的因子,通过实验测试了它们对分类性能的影响,得到文档代表性因子能使分类效果峰值最高但抵抗噪音特征能力差、文档区分性因子具有抗噪能力但性能不稳定、而类别区分性因子抗噪能力最强且性能最稳定的结论。最后给出权重表示的四点构造原则,并通过实验验证了其对分类性能的优化效果。 In traditional vector space based text categorization models, term weighting and feature selection are absolutely isolated. Although feature selection functions give a score to each term, the score is Seldom taken into account while weighting terms. This paper adopts term frequency, inverse document frequency and feature selection functions as the indication of the features" ability in representing a document, distinguishing different documents and distinguishing different categories respectively. The experimental results show that TF can raise the peak of the performance but it is sensitive to noisy features; IDF is tough to noise and but unstable; the feature selection function has strong moise-tolarent ability with stability. Finally, four criteria are proposed to combine the above factors to establish optimal weighting schemes and are further verified by experiments.
出处 《中文信息学报》 CSCD 北大核心 2010年第3期97-104,共8页 Journal of Chinese Information Processing
基金 国家自然科学基金资助项目(60873166) 国家973资助项目(2007CB311103) 国家863计划资助项目(2006AA010105)
关键词 计算机应用 中文信息处理 文本分类 权重表示 权重因子作用 VSM computer application Chinese information processing text categorization term weighting effects of weighting factors VSM
  • 相关文献

参考文献18

  • 1Yang Y.An evaluation of statistical approaches to text categorization[J].Information Retrieval,1999,1:69-90.
  • 2Sebastiani,F.Machine learning in automated text categorization[J],ACM Computing Surveys,2002,34(1):1-47.
  • 3苏金树,张博锋,徐昕.基于机器学习的文本分类技术研究进展[J].软件学报,2006,17(9):1848-1859. 被引量:383
  • 4Yang Y,Pedersen J.A Comparative Study on Feature Selection in Text Categorization[C]//Proceedings of the 14th International conference on Machine Learning,1997:412-420.
  • 5Yan J,Liu N,Zhang B,et al.OCFSj optimal orthogonal centroid feature selection for text categorization[C]//Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval,2005:122-129.
  • 6Yang Y,Liu X.A re-examination of text categorization methods[C]//Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval,1999:42-49.
  • 7Thorsten J,Text Categorization with Suport Vector Machines:Learning with Many Relevant Features[C]//Proceedings of the 10th European Conference on Machine Learning,1998:137-142.
  • 8Gerard S,Christopher B,Term-weighting approaches in automatic text retrieval[J].Information Processing and Management:an International Journal,1988,24(5),513-523.
  • 9Hassan S,Banea C,Random-Walk Term Weighting for Improved Text Classification[C]//Proceedings of TextGraphs:2nd Workshop on Graph Based Methods for Natural Language Processing,ACL,2006:53-60.
  • 10Shankar S,Karypis G.A Feature Weight Adjustment Algorithm for Document Categorization[C]//Proceedings of SIGKDD'OO Workshop on Text Mining,2000.

二级参考文献5

共引文献608

同被引文献124

引证文献16

二级引证文献88

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部