摘要
增量学习模型是一种有效挖掘大规模数据的数据处理技术。增量式偏最小二乘(incremental partial least square, IPLS)模型是一种基于增量技术的偏最小二乘算法改进模型,具有不错的数据降维效果,但是,IPLS模型每新增1个样本都需要对模型进行增量更新,导致模型的训练时间较长。针对这一问题,基于数据分块更新的思想提出了一种块增量偏最小二乘算法(chunk incremental partial least square, CIPLS)。CIPLS算法将样本数据划分为数个的数据块(chunk),然后再以数据块为单位对模型进行增量更新,从而大幅减少了模型的更新频率,提高了模型的学习效率。在K8版本的p53蛋白数据集和路透文本分类语料库上的对比实验表明,CIPLS算法大幅度缩短了增量式偏最小二乘模型的训练时间。
For the data mining of large-scale data, incremental learning is an effective and efficient technique. As an improved partial least square(PLS) method based on incremental learning, incremental partial least square(IPLS) has a competitive dimension reduction performance. However, there is a drawback in this approach that training samples must be learned one by one, which consumes a lot of time on the issue of on-line learning. To overcome this problem, we propose an extension of IPLS called chunk incremental partial least square(CIPLS) in which a chunk of training samples is processed at a time. Comparative experiments on k8 cancer rescue mutants data set and Reuter-21578 text classification corpus show the proposed CIPLS algorithm is much more efficient than IPLS without sacrifice dimension reduction performance.
作者
曾雪强
叶震麟
左家莉
万中英
吴水秀
ZENG Xue-qiang;YE Zhen-lin;ZUO Jia-li;WAN Zhong-ying;WU Shui-xiu(Information Engineering School, Nanchang University, Nanchang 330031, Jiangxi, China;School of Computer & Information Engineering, Jiangxi Normal University, Nanchang 330022, Jiangxi, China)
出处
《山东大学学报(理学版)》
CAS
CSCD
北大核心
2019年第3期93-101,共9页
Journal of Shandong University(Natural Science)
基金
国家自然科学基金资助项目(61463033
61866017)
江西省杰出青年人才资助计划(20171BCB23013)
江西省教育厅科学技术研究项目(GJJ150354)
关键词
增量学习
偏最小二乘
数据块
数据降维
incremental learning
partial least square
data chunk
dimension reduction