摘要
Tri-training是一种基于分歧的半监督学习算法,同时利用了半监督学习和集成学习机制。Tri-training能有效地利用少量有标记样本和大量无标记样本,通过分类器间的相互协同和迭代来提升模型性能。但是在已标记样本量不足的情况下,Tri-training生成的初始分类器训练不足,并且在分类器间协同标记的过程中可能产生误标记的噪声数据。针对上述问题,提出了一种结合DECORATE集成学习、多样性度量与置信度评估的协同学习算法。该算法基于DECORATE集成学习方法,通过添加差异化的人工样本和标记来训练多种偏好的基分类器,以提升分类泛化能力。该算法还基于JS散度对分类器进行多样性度量和筛选,以最大化基分类器多样性,同时在迭代过程中基于标签传播算法对伪标记样本进行置信度评估,以减少噪声数据。在UCI数据集上进行了分类实验,结果表明,相比Tri-training算法及其改进算法,所提算法具有更高的分类准确率和F1分数。
Tri-training is a disagreement-based semi-supervised learning algorithm,in which both semi-supervised learning and ensemble learning mechanisms are simultaneously applied.It can improve the model performance by effectively leveraging some labeled samples along with a large amount of unlabeled ones through collaborations and iterations among basic classifiers.How-ever,when the labeled sample size is insufficient,the initial classifiers generated by Tri-training are not sufficiently trained.Furthermore,mislabeled noisy data might be generated during the collaborative labeling process among the classifiers.Aiming at these problems,a collaborative learning algorithm is proposed,which combines DECORATE ensemble learning,diversity mea-sure and credibility assessment.In our method,to improve the generalization performance,multiple preference classifiers are generated based on DECORATE with differentiated artificial data and labels,and the diversities of classifiers are measured and selected by Jensen-Shannon divergence to maxmize the diversity of the classifiers.At the same time,the credibility of the pseudo labeled samples is assessed during the iterations by a label propagation algorithm to reduce the noisy data.The results of classification experiment on UCI data sets demonstrate that the proposed algorithm achieves higher accuracy and F1-score than Tri-trai-ning algorithm and its improved versions.
作者
王宇飞
陈文
WANG Yu-fei;CHEN Wen(School of Cyber Science and Engineering,Sichuan University,Chengdu 610065,China)
出处
《计算机科学》
CSCD
北大核心
2022年第6期127-133,共7页
Computer Science
基金
国家重点研发计划(020YFB1805405,2019QY0800)
国家自然科学基金(U1736212,61872255,U19A2068)
模式识别与智能信息处理四川省高校重点实验室(MSSB-2020-01)。
关键词
基于分歧的半监督学习
集成学习
置信度评估
多样性度量
Disagreement-based semi-supervised learning
Ensemble learning
Credibility assessment
Diversity measure