摘要
主动学习算法可以有效减少样本标注的工作量,每次选取最有信息量的样本交由专家标注。样本的代表性与不确定性都是衡量样本信息量的重要因素,将两者综合考虑能够获得更好的综合效果,但在两者的结合方式上一直存在不少问题,导致算法的适应性不强。为解决该问题,本文提出了基于样本不同属性的鲁棒偏倚赖主动学习分类算法,通过引入偏倚赖权值系数函数,在综合考虑样本的代表性和不确定性的同时,更可以突出样本的特性。同时由于样本代表性模型的渐变,在选择样本过程中更能突出代表性样本与不确定性样本的学习层次,前期训练以代表性样本为主,后期训练以不确定性样本为主,使得算法的适应性大大提高。在UCI机器学习数据库上的仿真实验结果表明本文的思路是合理可行的,在实验所用数据集上,与所提供的对比算法相比,本文的方法只需较少的标注样本便可以达到相同的分类正确率。
Active learning algorithm can alleviate effectively the efforts of labeling instances by selecting the most informative examples for experts to label in each training step.Representative and uncertainty of data selection are significant factors for searching information of samples,while the existed algorithm having some problems on the way for combining the two factors,so there will be better result if adaptive considered the two factors in training proceeding.In order to solve this problem,an algorithm of different sample attributes-based robust and partial dependent active learning for classification is proposed in this paper.The algorithm emphasizes a certain characteristic of data by introducing a coefficient-weighted function which generally considered representative and uncertainty of data meanwhile,and the algorithm is robust while giving prominence to the learning levels of representative and uncertain samples,thus it gives priority to representative data in the early stage and uncertain data in the later stage on account of the gradual changing model of classification.The simulation experimental results show that this method is valid and efficient,and it selects fewer instances than relative methods on used UCI datasets when obtaining the same classification accuracy.
出处
《燕山大学学报》
CAS
2011年第1期74-80,共7页
Journal of Yanshan University
基金
河北省自然科学基金资助项目(F2008000891F2010001297)
中国博士后自然科学基金资助项目(20080440124)
第二批中国博士后基金特别资助项目(200902356)
关键词
主动学习
偏倚赖
样本代表性
样本不确定性
分类
active learning
partial dependency
representative of data selection
uncertainty of data selection
classification