期刊文献+

网络文本分类中基于信息瓶颈的特征提取 被引量:6

Information bottleneck based feature selection in web text categorization
原文传递
导出
摘要 针对网络文本特征关键词多、新词多的特点,提出了一种基于概念特征的文本分类提取方法。应用信息瓶颈法,根据关键词在不同类标号上的分布情况完成关键词聚类。在此基础上,结合概念抽取的方法,将词聚类结果映射到知网义原,并以此作为分类特征。在网络文本语料上的分类实验显示,该方法保留了基于概念特征提取方法的鲁棒性强、特征维数低的优点,但克服了概念词典中新词无定义,需要维护更新词典的不足。 This paper presents a concept-based feature selection schema for text categorization. The information bottleneck method was used to cluster the key words based on their distributions on different class labels. Then, concept extraction was used to map the word clusters to DEF items in HowNet as classification features. Tests on an online text corpus show that this approach preserves the robustness of concept-based feature selection methods and overcomes their shortcomings for new words not defined in the concept thesaurus which needs to be maintained and updated.
出处 《清华大学学报(自然科学版)》 EI CAS CSCD 北大核心 2010年第1期45-48,53,共5页 Journal of Tsinghua University(Science and Technology)
基金 国家自然科学基金资助项目(60673109,60871100) 教育部哲学社科重大项目(07JZD0005) 中科院模式识别国家重点实验室开放基金资助
关键词 文本分类 特征提取 信患瓶颈法 text categorization feature selection information bottleneck method
  • 相关文献

参考文献10

  • 1Joachims T. Text categorization with support vector maehines : Learning with many relevant features [C]//Proceedings of Machine Learning: ECML-98. 10th European Conference on Machine Learning. Berlin, Germany: Springer, 1998:137-142.
  • 2林静,曹德芳,苑春法.中文时间信息的TIMEX2自动标注[J].清华大学学报(自然科学版),2008,48(1):117-120. 被引量:20
  • 3Dumais S, Platt J, Heckerman D, et al. Inductive learning algorithms and representations for text categorization [C]// Proceedings of International Conference on Information and Knowledge Management. New York, USA: ACM Press, 1998: 148- 155.
  • 4董振东,董强.The download of Hownet[EB/OL].(2008-01-01).http://www.keenage.com.
  • 5Slonim N, Friedman N, Tishby N. Unsupervised document classification using sequential information maximization.[C]//Proceedings of 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York, USA: ACM Press, 2002: 129- 136.
  • 6Douglas Baker L, McCallum A. Distributional clustering of words for text classification [C]// Proc of the 21st Ann Int ACM SIGIR Conf on R and D in Info Retrieval. New York, USA: ACM Press, 1998:96-103.
  • 7Slonim N, Tishby N. Document clustering using word clusters via the information bottleneck method [C]// Proc of the 23rd Ann Int ACM SIGIR Conf on R and D in Info Retrieval. New York, USA: ACM Press, 2000: 208-215.
  • 8AI-Mubaid H, Umair S A. A new text categorization technique using distributional clustering and learning logic [J]. IEEE Trans on Knowledge and Data Eng, 2006, 18(9), 1156 - 1165.
  • 9Slonim N, Friedman N, Tishby N. Agglomerative multivariate information bottleneck[C]// Advances in Neural Information Processing Systems 14. Cambridge, MA, USA: MIT Press, 2002: 929-936.
  • 10Sebastiani F. Machine learning in automatic text categorization [J]. ACM Computing Surveys, 2002, 34(1): 1 -47.

二级参考文献6

  • 1Mani I, Wilson G. Robust temporal processing of news [C]//Proe of the 38th Annual Meeting on ACL. Morristown: ACL, 2000:69-76.
  • 2Ferro L, Gerber L, Mani I, et al. TIDES 2003 standard for the annotation of temporal expressions[EB/OL] (2003-09) http://timex2. mitre.org.
  • 3Ferro L, Gerber L, Mani I, et al. TIDES 2005 Standard for the Annotation of Temporal Expressions[EB/OL]. (2005-09) http: //timex2. mitre.org.
  • 4Wilson G, Mani I, Sundheim B, et al. A multilingual approach to annotating and extracting temporal information[C]//Proc of the workshop on temporal and spatial information processing. Morristown: ACL, 2001: 1- 7.
  • 5JANG Seok Bae, Baldwin J, Mani I. Automatic TIMEX2 Tagging of Korean News [C]// ACM TALIP Processing. NY: ACM Press, 2004, 3(1): 51 -65.
  • 6Gerber L, Huang S, Wang X. Standard for the annotation of temporal expressions, Chinese supplement draft [EB/OL]. (2004-04). http://timex2. mitre. org.

共引文献19

同被引文献71

引证文献6

二级引证文献28

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部