摘要
数据科学时代,基于某些数据集训练机器学习算法是常见的。通过调查或科学实验,可以前瞻性地收集到数据集。最近,已经认识到训练数据集只具有代表性是不够的,如果受训练的系统要很好地处理一些不太流行的类别,则必须包括来自这些类别的足够的例子,这便是数据集覆盖问题。本文在已有的处理数据集覆盖问题的方法的基础上,结合关联规则挖掘相关算法的思想,提出了获取MUP的优化算法,提高了获取MUP的运行效率;另外还提出了计算coverage算法面对数据稀疏问题以及位图过大、内存不足问题的解决思路,最后通过理论分析以及对实际数据集的综合实验,验证了获取MUP优化算法的优越性。
In the era of data science,it is common to train machine learning algorithms based on certain data sets.Through surveys or scientific experiments,we can collect data sets prospectively.Recently,it has been recognized that the training data set is only representative,it is not enough.If the trained system is to handle some less popular categories well,it must include enough examples from these categories.This is the data set coverage.problem.In this paper,based on the existing methods to deal with the problem of data set coverage,combined with the idea of association rules mining related algorithms,an optimization algorithm for obtaining MUP is proposed to improve the operating efficiency of obtaining MUP;Solutions to sparse problems and insufficient bitmaps due to insufficient memory.Finally,through theoretical analysis and comprehensive experiments on actual data sets,we verified the superiority of obtaining MUP optimization algorithms.
作者
刘荣鑫
LIU Rongxin(School of Computer Science and Technology,Harbin Institute of Technology,Harbin 150001,China)
出处
《智能计算机与应用》
2020年第6期79-85,共7页
Intelligent Computer and Applications