摘要
微生物会对人类健康产生直接影响,对相关数据的分析有助于疾病诊断。然而,采集到的数据存在类不平衡与高稀疏性两个问题。现有的过采样方法在一定程度上可缓解数据的类不平衡,但是难以应对微生物数据的高稀疏性。本文提出了一种融合矩阵分解和代价敏感的数据扩增算法,其包含3个技术。首先,将原始矩阵分解为样本子空间和特征子空间;其次,利用样本子空间的正向量及其近邻向量生成合成向量;最后,根据合成向量与所有负向量的距离对其过滤。实验在8个微生物数据集上进行,同时与5种过采样算法对比,结果表明本文所提算法能够增强正样本的多样性,在识别出更多正样本的同时,分类结果的代价更低。
Microorganisms have a direct impact on human health,and the analysis of relevant data is helpful for disease diagnosis.However,the collected data suffers from two problems:class imbalance and high sparseness.Existing oversampling methods can alleviate the class imbalance of data to a certain extent,but it is difficult to cope with the high sparsity of microbial data.This paper proposes a data augmentation algorithm that fuses matrix factorization and cost-sensitive,which consists of three techniques.First,the original matrix is decomposed into a sample subspace and a feature subspace.Second,the positive vectors of the sample subspace and their neighbor vectors are used to generate synthetic vectors.Finally,the synthetic vectors are filtered according to their distance from all negative vectors.The proposed algorithm is compared with five oversampling algorithms on 8 microbial datasets.The results show that the proposed algorithm can enhance the diversity of positive samples and identify more positive samples with lower classification cost.
作者
王曦
温柳英
闵帆
WANG Xi;WEN Liuying;MIN Fan(School of Computer Science,Southwest Petroleum University,Chengdu 610500,China)
出处
《数据采集与处理》
CSCD
北大核心
2023年第2期401-412,共12页
Journal of Data Acquisition and Processing
基金
中央引导地方科技发展专项项目(2021ZYD0003)
西南石油大学启航计划(2018QHR007)。