现有的基于深度学习模型的词嵌入方法用于Web异常检测时,通常将语料库中没有出现的未知词汇(Out of Vocabulary,OOV)设置为unknown,并赋予零或随机向量输入到模型中进行训练,未考虑未知词汇在Web请求语句中的上下文关系。同时,在Web系...现有的基于深度学习模型的词嵌入方法用于Web异常检测时,通常将语料库中没有出现的未知词汇(Out of Vocabulary,OOV)设置为unknown,并赋予零或随机向量输入到模型中进行训练,未考虑未知词汇在Web请求语句中的上下文关系。同时,在Web系统代码开发过程中,基于个人习惯并为了增加代码的可读性,程序员设计的请求路径代码往往存在一定的模式。因此,考虑到Web请求的模式和单词语义间的相关性,研究基于Word2vec的动态未知词表示方法DUWe(Dynamic Unknown Word Embedding),该方法通过分析Web请求路径中单词上下文的关系来赋予未知词向量的表示内容。在CSIC-2010和WAF Dataset数据集上的实验评估表明,增加未知词表示方法比仅用Word2vec静态特征提取方法具有更好的性能,同时在准确性、精准率、召回率和F1-Score方面均有提高,在训练时间上最大降低1.14倍。展开更多
RNA分子的动力学与其功能密切相关。RNA分子的柔性,作为其动力学最基本的特性之一,已被广泛用于研究其折叠性质、结构稳定性和配体结合能力等诸多方面。实验测定RNA柔性的方法往往比较耗时费力,因此急需发展一种快速、准确的理论方法来...RNA分子的动力学与其功能密切相关。RNA分子的柔性,作为其动力学最基本的特性之一,已被广泛用于研究其折叠性质、结构稳定性和配体结合能力等诸多方面。实验测定RNA柔性的方法往往比较耗时费力,因此急需发展一种快速、准确的理论方法来预测RNA的柔性。为此,本文提出了一种机器学习方法RNAfwe来预测RNA柔性,该方法采用词嵌入技术提取RNA序列特征。RNAfwe与同类基于序列的RNAflex方法比较,结果显示:相比于使用独热编码的RNAflex (One-Hot),RNAfwe在训练和测试集上都获得了更高的皮尔逊相关系数(PCC) 0.5017和0.4704,这表明词嵌入相较于独热编码可从RNA序列中提取与柔性更相关的特征;相比于利用进化信息的RNAflex (PSSM),尽管RNAfwe的性能稍差,但前者需要知道足够的同源序列。这项工作有助于RNA动力学性质的研究,另外为词嵌入技术广泛用于生物信息学研究提供了支持。RNA molecular dynamics is closely related to their functions. The flexibility of RNA molecules, as one of the most fundamental characteristics of their dynamics, has been widely used to study their folding properties, structural stability, ligand binding ability and so on. Experimental methods for measuring RNA flexibility are often time-consuming and labor intensive, so there is an urgent need to develop a fast and accurate theoretical method to predict RNA flexibility. To this end, we propose a machine learning method, RNAfwe, to predict RNA flexibility, which uses the word embedding technique to extract RNA sequence features. The comparison of RNAfwe with the similar sequence-based RNAflex method shows that compared with RNAflex (One-Hot), RNAfwe obtains higher Pearson correlation coefficients (PCC) of 0.5017 and 0.4704 on both training and test sets, indicating that the word embedding could extract the more related features to flexibility from RNA sequences than the one-hot encoding. Compared with RNAflex (PSSM) which uses evolutionary information, although RNAfwe has a slightly inferior performance, the former requires the knowledge of sufficient homologous sequences. This work contributes to the study of RNA dynamic properties, and provides the support for word embedding technique to be widely used in bioinformatics research.展开更多
针对玉米育种文本数据中存在重叠三元组、实体表达方式多样等问题,提出一种嵌入词汇信息的BERT-CRF(Bidirectional encoder representations from transformers-conditional random field)玉米育种实体关系联合抽取方法。首先,分析了玉...针对玉米育种文本数据中存在重叠三元组、实体表达方式多样等问题,提出一种嵌入词汇信息的BERT-CRF(Bidirectional encoder representations from transformers-conditional random field)玉米育种实体关系联合抽取方法。首先,分析了玉米育种语料表达特征,采用对实体边界、关系类别和实体位置信息同步标注的策略;其次,构建了嵌入词汇信息的BERT-CRF模型进行训练和预测,自建玉米育种知识词典,通过在BERT中嵌入词汇信息,融合字符特征和词汇特征,增强模型的语义能力,利用CRF模型输出全局最优标签序列,设计了实体关系三元组匹配算法(Entity and relation triple matching algorithm,ERTM),将标签进行匹配和映射来获取三元组;最后,为验证该方法的有效性,在玉米育种数据集上进行实验,结果表明,本文模型精确率、召回率和F1值分别为91.84%、95.84%、93.80%,与现有模型相比性能均有提升。说明该方法能够有效抽取玉米育种领域知识,为构建玉米育种知识图谱及其它下游任务提供数据基础。展开更多
文摘现有的基于深度学习模型的词嵌入方法用于Web异常检测时,通常将语料库中没有出现的未知词汇(Out of Vocabulary,OOV)设置为unknown,并赋予零或随机向量输入到模型中进行训练,未考虑未知词汇在Web请求语句中的上下文关系。同时,在Web系统代码开发过程中,基于个人习惯并为了增加代码的可读性,程序员设计的请求路径代码往往存在一定的模式。因此,考虑到Web请求的模式和单词语义间的相关性,研究基于Word2vec的动态未知词表示方法DUWe(Dynamic Unknown Word Embedding),该方法通过分析Web请求路径中单词上下文的关系来赋予未知词向量的表示内容。在CSIC-2010和WAF Dataset数据集上的实验评估表明,增加未知词表示方法比仅用Word2vec静态特征提取方法具有更好的性能,同时在准确性、精准率、召回率和F1-Score方面均有提高,在训练时间上最大降低1.14倍。
文摘RNA分子的动力学与其功能密切相关。RNA分子的柔性,作为其动力学最基本的特性之一,已被广泛用于研究其折叠性质、结构稳定性和配体结合能力等诸多方面。实验测定RNA柔性的方法往往比较耗时费力,因此急需发展一种快速、准确的理论方法来预测RNA的柔性。为此,本文提出了一种机器学习方法RNAfwe来预测RNA柔性,该方法采用词嵌入技术提取RNA序列特征。RNAfwe与同类基于序列的RNAflex方法比较,结果显示:相比于使用独热编码的RNAflex (One-Hot),RNAfwe在训练和测试集上都获得了更高的皮尔逊相关系数(PCC) 0.5017和0.4704,这表明词嵌入相较于独热编码可从RNA序列中提取与柔性更相关的特征;相比于利用进化信息的RNAflex (PSSM),尽管RNAfwe的性能稍差,但前者需要知道足够的同源序列。这项工作有助于RNA动力学性质的研究,另外为词嵌入技术广泛用于生物信息学研究提供了支持。RNA molecular dynamics is closely related to their functions. The flexibility of RNA molecules, as one of the most fundamental characteristics of their dynamics, has been widely used to study their folding properties, structural stability, ligand binding ability and so on. Experimental methods for measuring RNA flexibility are often time-consuming and labor intensive, so there is an urgent need to develop a fast and accurate theoretical method to predict RNA flexibility. To this end, we propose a machine learning method, RNAfwe, to predict RNA flexibility, which uses the word embedding technique to extract RNA sequence features. The comparison of RNAfwe with the similar sequence-based RNAflex method shows that compared with RNAflex (One-Hot), RNAfwe obtains higher Pearson correlation coefficients (PCC) of 0.5017 and 0.4704 on both training and test sets, indicating that the word embedding could extract the more related features to flexibility from RNA sequences than the one-hot encoding. Compared with RNAflex (PSSM) which uses evolutionary information, although RNAfwe has a slightly inferior performance, the former requires the knowledge of sufficient homologous sequences. This work contributes to the study of RNA dynamic properties, and provides the support for word embedding technique to be widely used in bioinformatics research.
文摘针对玉米育种文本数据中存在重叠三元组、实体表达方式多样等问题,提出一种嵌入词汇信息的BERT-CRF(Bidirectional encoder representations from transformers-conditional random field)玉米育种实体关系联合抽取方法。首先,分析了玉米育种语料表达特征,采用对实体边界、关系类别和实体位置信息同步标注的策略;其次,构建了嵌入词汇信息的BERT-CRF模型进行训练和预测,自建玉米育种知识词典,通过在BERT中嵌入词汇信息,融合字符特征和词汇特征,增强模型的语义能力,利用CRF模型输出全局最优标签序列,设计了实体关系三元组匹配算法(Entity and relation triple matching algorithm,ERTM),将标签进行匹配和映射来获取三元组;最后,为验证该方法的有效性,在玉米育种数据集上进行实验,结果表明,本文模型精确率、召回率和F1值分别为91.84%、95.84%、93.80%,与现有模型相比性能均有提升。说明该方法能够有效抽取玉米育种领域知识,为构建玉米育种知识图谱及其它下游任务提供数据基础。