Data-driven machine learning, especially deep learning technology, is becoming an important tool for handling big data issues in bioinformatics. In machine learning, DNA sequences are often converted to numerical valu...Data-driven machine learning, especially deep learning technology, is becoming an important tool for handling big data issues in bioinformatics. In machine learning, DNA sequences are often converted to numerical values for data representation and feature learning in various applications. Similar conversion occurs in Genomic Signal Processing(GSP), where genome sequences are transformed into numerical sequences for signal extraction and recognition. This kind of conversion is also called encoding scheme. The diverse encoding schemes can greatly affect the performance of GSP applications and machine learning models. This paper aims to collect,analyze, discuss, and summarize the existing encoding schemes of genome sequence particularly in GSP as well as other genome analysis applications to provide a comprehensive reference for the genomic data representation and feature learning in machine learning.展开更多
针对句子分类任务常面临着训练数据不足,而且文本语言具有离散性,在语义保留的条件下进行数据增强具有一定困难,语义一致性和多样性难以平衡的问题,本文提出一种惩罚生成式预训练语言模型的数据增强方法(punishing generative pre-train...针对句子分类任务常面临着训练数据不足,而且文本语言具有离散性,在语义保留的条件下进行数据增强具有一定困难,语义一致性和多样性难以平衡的问题,本文提出一种惩罚生成式预训练语言模型的数据增强方法(punishing generative pre-trained transformer for data augmentation,PunishGPT-DA)。设计了惩罚项和超参数α,与负对数似然损失函数共同作用微调GPT-2(generative pre-training 2.0),鼓励模型关注那些预测概率较小但仍然合理的输出;使用基于双向编码器表征模型(bidirectional encoder representation from transformers,BERT)的过滤器过滤语义偏差较大的生成样本。本文方法实现了对训练集16倍扩充,与GPT-2相比,在意图识别、问题分类以及情感分析3个任务上的准确率分别提升了1.1%、4.9%和8.7%。实验结果表明,本文提出的方法能够同时有效地控制一致性和多样性需求,提升下游任务模型的训练性能。展开更多
基金supports from the Department of Computing Sciences, State University of New York College at Brockport
文摘Data-driven machine learning, especially deep learning technology, is becoming an important tool for handling big data issues in bioinformatics. In machine learning, DNA sequences are often converted to numerical values for data representation and feature learning in various applications. Similar conversion occurs in Genomic Signal Processing(GSP), where genome sequences are transformed into numerical sequences for signal extraction and recognition. This kind of conversion is also called encoding scheme. The diverse encoding schemes can greatly affect the performance of GSP applications and machine learning models. This paper aims to collect,analyze, discuss, and summarize the existing encoding schemes of genome sequence particularly in GSP as well as other genome analysis applications to provide a comprehensive reference for the genomic data representation and feature learning in machine learning.
文摘针对句子分类任务常面临着训练数据不足,而且文本语言具有离散性,在语义保留的条件下进行数据增强具有一定困难,语义一致性和多样性难以平衡的问题,本文提出一种惩罚生成式预训练语言模型的数据增强方法(punishing generative pre-trained transformer for data augmentation,PunishGPT-DA)。设计了惩罚项和超参数α,与负对数似然损失函数共同作用微调GPT-2(generative pre-training 2.0),鼓励模型关注那些预测概率较小但仍然合理的输出;使用基于双向编码器表征模型(bidirectional encoder representation from transformers,BERT)的过滤器过滤语义偏差较大的生成样本。本文方法实现了对训练集16倍扩充,与GPT-2相比,在意图识别、问题分类以及情感分析3个任务上的准确率分别提升了1.1%、4.9%和8.7%。实验结果表明,本文提出的方法能够同时有效地控制一致性和多样性需求,提升下游任务模型的训练性能。