摘要
在深度学习中,随着卷积神经网络(CNN)的深度不断增加,进行神经网络训练所需的数据会越来越多,但基因结构变异在大规模基因数据中属于小样本事件,导致变异基因的图像数据十分匮乏,严重影响了CNN的训练效果,造成了基因结构变异检测精度差、假阳性率高等问题。为增加基因结构变异样本数量,提高CNN识别基因结构变异的精度,提出了一种基于生成对抗网络(GAN)进行基因图像数据扩增的方法——GeneGAN。首先,利用Reads堆叠方法生成初始基因图像数据,将变异基因图像数据与非变异基因图像数据分为两个数据集;然后,为了平衡正负样本数据集,使用GeneGAN对变异图像样本进行扩充;最后,通过CNN对平衡前后数据集进行检测,并对精确率、召回率与F1值进行对比。实验结果显示,与传统扩增方法、生成对抗网络扩增方法、特征提取方法相比,GeneGAN对基因结构变异检测的F1值提升了1.94~17.46个百分点,说明使用GeneGAN进行基因数据生成能够有效提高使用CNN进行基因图像分类的精确率。
In deep learning,as the depth of Convolutional Neural Network(CNN)increases,more and more data is required for neural network training,but gene structure variation is a small sample event in large-scale genetic data,resulting in a very shortage of image data of variant genes,which seriously affects the training effect of CNN and causes the problems of poor gene structure variation detection precision and high false positive rate.In order to increase the number of gene structure variation samples and improve the precision of CNN to identify gene structure variation,a gene image data augmentation method was proposed based on GAN(Generative Adversarial Network),namely GeneGAN.Firstly,initial genetic image data was generated by using the Reads stacking method and it was divided into two datasets including variant gene images and non-variant gene images.Secondly,GeneGAN was used to augment the variant image samples to balance the positive and negative datasets.Finally,CNN was used to detect the datasets before and after augmentation,and precision,recall and F1 score were used as measurement indicators.Experimental results show that compared with tradional augmentation method,GAN based augmentation method and feature extraction method,the F1 score of GeneGAN is improved by 1.94 to 17.46 percentage points,verifying that GeneGAN method can improve the precision of CNN to identify gene structure variation.
作者
曹一珉
蔡磊
高敬阳
CAO Yimin;CAI Lei;GAO Jingyang(College of Information Science and Technology,Beijing University of Chemical Technology,Beijing 100029,China)
出处
《计算机应用》
CSCD
北大核心
2022年第3期783-790,共8页
journal of Computer Applications
基金
北京市自然科学基金资助项目(5182018)。
关键词
生成对抗网络
残差学习
基因图像
卷积神经网络
数据增强
Generative Adversarial Network(GAN)
residual learning
gene image
Convolution Neural Network(CNN)
data augmentation