摘要
以往研究者都从公式的合理性出发研究迁移学习和传统机器学习,但他们忽视了对问题的整体性考虑,致使在具体应用到文本分类问题时,无法实现彻底的分类。通过研究文本分类的整个过程,在k-均值算法中使用余弦距离,显著提高了实验结果;提出保护型迭代思想,同时弃用传统的词特征空间,采用隐空间作为特征向量空间,实施归一化约束。以CCI算法为例,结合提出的改进思想,产生改进算法PCCI,在降低计算复杂度的同时显著提高迁移学习的分类正确率。通过在数据集20-News Groups和Reuters-21578上测试并与现有其他迁移学习算法进行比较,证明了该改进算法的优越性。
Former researchers commonly study transfer learning algorithms and traditional machine learning from the point of the rationality of formulas, while neglecting the integrality of the problem. As a result, their algorithms are usually unable to thoroughly practice classification when they are applied to specific text classification problem. Via observing the whole process of text classification, it uses cosine distance in k-mean method and gets obviously better results. It proposes protection-type iteration idea. It abandons traditional word feature space and chooses hidden space as the feature vector space and implements normalization constraints. Taking CCI algorithm as an example, this idea is used to create an improved algorithm which is nominated PCCI. This algorithm can prominently raise the classification accuracy of transfer learning, meanwhile reducing the computing complexity. It proves the superiority of the improved algorithm by comparing with other former transfer learning cases through program testing on the database of 20-NewsGroups and Reuters-21578.
出处
《计算机工程与应用》
CSCD
北大核心
2015年第23期131-138,225,共9页
Computer Engineering and Applications
关键词
迁移学习
欧式距离
余弦距离
保护型
归一化约束
过维数
transfer learning
Euclidean distance
cosine distance
protection-type
normalization constraints
over dimension