摘要
主动学习是机器学习领域的重要研究方向。现有主动学习方法通常选择不确定性的或具有代表性的样本供专家打标,然后添加到已标记的数据集中供分类器学习,但没能充分利用数据的分布信息,并且在野点采集问题上有待改进。结合邻域粗糙集理论,提出了一种基于邻域粗糙集的主动学习方法(neighhbor rough set active learning,NRS-AL)。实验结果表明,在加州大学数据集(university of California Irvine,UCI)上,该算法充分利用了数据的分布信息,同时结合样本的不确定性和代表性计算,处理了野点的选择,是一种能有效解决主动学习样本选择问题的算法,在accuracy,受试者工作特征(receiver operating characteristic curve,ROC)曲线下面的面积(area under curve,AUC)指标上优于文献中的主动学习算法。
Active learning is one of the major research directions of machine learning.Most active learning approaches select uncertain or representative unlabeled samples to query their labels,and then add them into labeled data sets for classifier learning.However,these approaches have not fully utilized data distribution information,and not processed outlier acquisition problem well enough,too.With neighbor rough set theory,an algorithm named NRS-AL is proposed.The experiment results have shown that in UCI data set,combined with uncertainty and representative calculation of samples,the proposed algorithm in this paper has solved the previous problems,and is effective in solving sample choosing problems in active learning,which shows better accuracy and AUC performances than others in the literatures.
出处
《重庆邮电大学学报(自然科学版)》
CSCD
北大核心
2017年第6期776-784,共9页
Journal of Chongqing University of Posts and Telecommunications(Natural Science Edition)
基金
国家自然科学基金(61309014)
教育部人文社科规划项目(15XJA630003)
重庆市教委科学技术研究项目(KJ1500416)
重庆市基础与前沿研究计划项目(cstc2013jcyj A40063)~~
关键词
邻域粗糙集
主动学习
基于池的样本选择
neighborhood rough set
active learning
pool-based sample selection