摘要
主动学习查询策略有助于从未标注数据中选择能够提高分类模型性能指标的样例,减少人工标注陈本,基于期望损失最小化的主动学习查询策略有助于选择未标注实例,然而该策略存在计算复杂度高,随机采样性能不稳定等问题,因此,从信息熵具有较强衡量未标注样本的信息量出发,提出基于信息熵抽样估计的统计学习查询策略,该策略使用已标注样例得到的训练模型对未标注实例池中每个样例计算信息熵,选择若干不确定度最高样例并计算相应数据分布的期望经验风险,选择使期望经验风险最小的样例进行标注.在公开的UCI机器学习数据集(包括tic-tac-toe、transfusion、kr-vs-kp、diagnosis、breast-cancer等)上针对不同标注比例(比如20%、40%、60%、80%、100%),以及不同的分类器(比如随机森林、逻辑斯蒂回归等)进行实证研究表明,相对于随机采样策略,该策略计算复杂度从O(N2)降低为O(Q×N),ACCURACY指标在最好情况下最高提升6%.
The active learning query strategy is helpful to select examples from the unlabeled dataset that can improve the performance of the classification model,and reduce manual labeling cost. The active learning query strategy based on the minimization of expected loss was helpful to select unlabeled instances. However,this strategy had high computational complexity and unstable sampling performance. Therefore,query strategy based on statistical learning from information entropy sampling estimation was proposed because of information entropy with strong measure for unlabeled instances. The strategy used the training model obtained by the labeled example to calculate the information entropy for each instance in the unlabeled instance pool,the instances with highest degree of uncertainty were selected and the expected empirical risk of the corresponding data distribution was calculated. The corresponding instance was selected rending the lowest expected empirical risk. Empirical research on different percentage of queried instances(such as 20%、40%、60%、80%、100%)and different classifiers(including random forest、logistic classifier)was conducted on the public UCI machine learning datasets(including tic-tac-toe、transfusion、kr-vs-kp、diagnosis、breast-cancer). Empirical result shows that this strategy can effectively reduce the computational complexity from O(N2)to O(Q × N)compared to the random sampling strategy. The ACCURACY performance is the promoted by 6% in best case.
作者
曲豫宾
陈翔
QU Yu-bin;CHEN Xiang(Jiangsu College of Engineering and Technology,Nantong 226007,China;Department of Information Science and Technology,Nantong University,Nantong 226019,China)
出处
《通化师范学院学报》
2019年第12期66-72,共7页
Journal of Tonghua Normal University
基金
南通市市级科技项目(JC2018134)
关键词
信息熵
主动学习
统计学习
information entropy
active learning
statistical learning