摘要
传统的文本分类方法假设训练集与测试集中的特征词服从相同的概率分布,但在实际应用中,以上假设存在偏差,会影响到最终的分类结果。针对这一情况,本文采用迁移学习,通过计算特征词的迁移量对训练集中向量空间模型进行修正,最终使训练集与测试集中特征词的分布概率趋于一致。将提出的方法应用于中文垃圾邮件过滤与中、英文网页分类中,在CHI统计特征选择基础上进行特征迁移,实验结果表明新方法可以有效消除特征词分布的差异性,使文本分类的各项指标明显提高。
Traditional text classification methods assume that feature words in the training set and test set follow the same probability distribution. Nevertheless, deviations exist in a practical application, which can affect the final classification results. To solve the problem, a feature transfer learning algorithm for text categorization is proposed. By calculating the transfer volume and amending the vector space model in the training set, the distribution probability of feature words can be reconciled for the training set and test set. Experiments on Chinese spam filtering and web page classification data sets demonstrate that the proposed method can eliminate the dissimilarity of distributions of feature words, and improve the va rious indexes of test classification evidently.
作者
赵世琛
王文剑
Zhao Shichen Wang Wenjian(School of Computer and Information Technology, Shanxi University, Taiyuan, 030006, Chin)
出处
《数据采集与处理》
CSCD
北大核心
2017年第3期516-522,共7页
Journal of Data Acquisition and Processing
基金
国家自然科学基金(60975035
61273291)资助项目
山西省回国留学人员科研基金(2012008)资助项目
关键词
文本分类
迁移学习
迁移量
向量空间模型
text categorization
transfer learning
transfer volume
vector space model