摘要
针对网络文本特征关键词多、新词多的特点,提出了一种基于概念特征的文本分类提取方法。应用信息瓶颈法,根据关键词在不同类标号上的分布情况完成关键词聚类。在此基础上,结合概念抽取的方法,将词聚类结果映射到知网义原,并以此作为分类特征。在网络文本语料上的分类实验显示,该方法保留了基于概念特征提取方法的鲁棒性强、特征维数低的优点,但克服了概念词典中新词无定义,需要维护更新词典的不足。
This paper presents a concept-based feature selection schema for text categorization. The information bottleneck method was used to cluster the key words based on their distributions on different class labels. Then, concept extraction was used to map the word clusters to DEF items in HowNet as classification features. Tests on an online text corpus show that this approach preserves the robustness of concept-based feature selection methods and overcomes their shortcomings for new words not defined in the concept thesaurus which needs to be maintained and updated.
出处
《清华大学学报(自然科学版)》
EI
CAS
CSCD
北大核心
2010年第1期45-48,53,共5页
Journal of Tsinghua University(Science and Technology)
基金
国家自然科学基金资助项目(60673109,60871100)
教育部哲学社科重大项目(07JZD0005)
中科院模式识别国家重点实验室开放基金资助
关键词
文本分类
特征提取
信患瓶颈法
text categorization
feature selection
information bottleneck method