摘要
文本分类的效果依赖于文本特征选择的好坏.传统的特征选择方法,利用特征的词频或者特征与类别的关系,进行特征选择.不仅没有考虑特征的语义,而且大多只能用于标注数据集的特征选择.本文提出LDA词向量特征选择方法和Word2vec词向量特征选择方法,分别在主题空间和词语上下文关系上,学习特征的语义,进行特征选择.语料经特征选择后,利用向量空间模型进行分类.在复旦语料上的实验结果表明,基于词向量的特征选择分类效果相对于传统的特征选择得到了改善.并且,基于词向量的特征选择是一种无监督的方法,无需标注类别信息.
The effect of text categorization depends on quality of feature selection. Traditional methods use word frequency or relationship between features and categories for selection. These not only ignore semantics of features,but also most of them can only be used to tagged data set. In this paper,feature selection methods based on LDA word vector and Word2vec word vector are proposed. They learn the semantics of features from topic and contextual words,for feature selection respectively. After feature selection,the vector space model is used to classify the corpus. Experimental results on Fudan corpus show that the effects of feature selection based on word vector are better than traditional feature selection methods. Moreover,feature selection based on word vector is an unsupervised method,without labeling category.
作者
陈磊
李俊
CHEN Lei;LI Jun(Department of Automation, School of Information Science and Technology, University of Science and Technology of China, Hefei 230026, Chin)
出处
《小型微型计算机系统》
CSCD
北大核心
2018年第5期991-994,共4页
Journal of Chinese Computer Systems
基金
工业互联网网络架构基础共性和关键技术标准试验验证项目资助