期刊文献+

一种集成深度神经网络和正则表达式的基因变异命名实体识别算法 被引量:1

An integrated algorithm based on deep neural network and regular expression patterns to recognize gene mutation entities in biomedical literature
原文传递
导出
摘要 目的基因变异命名实体的自动化正确识别是从生物医学文献中挖掘基因-变异-疾病关系知识的重要基础。该文提出一种以深度神经网络为主、结合维特比解码和正则表达式的组合算法,用于识别基因变异命名实体。方法受单词分布式表示的启发,提出一种深度分词策略,即以字母的大小写、数字和特殊符号将单词分开来捕捉变异名称中各部分的结构信息,其中最小的分词单位定义为token;使用Glo Ve训练深度分词的token向量,一个单词的全部token向量用于训练该单词的词向量。以句子的词向量序列为输入,利用一个双向长短期记忆网络(Bi-LSTM)学习变异名称的一般形式并捕捉上下文信息,后接一个全联接层以提高拟合能力,得到词的标签概率序列作为初步输出;随后采用维特比算法对初步输出进行优化,最后增加正则表达式匹配的结果完成识别。结果该算法经在NCBI tm Var语料库上训练和测试,取得了91. 59%的F1值,高于目前国际上已公开报道的识别系统。结论该算法避免了复杂的人工特征工程并表现出优越的性能,可用于快速定位生物医学文本中的变异实体,为进一步的关系抽取研究打下基础。 Objective Automatic recognizing mutation mentions plays a fundamental and critical role in mining genevariant-disease relation knowledge from biomedical literature.Methods In this paper,we proposed an advanced algorithm for mutation mentions detection,which consisted of the deep neural network,Viterbi decoding and regular expression.Inspired by the distributed representation of words,we divided each word by letters of difference case,numbers and special characters into tokens for training a token embedding which could capture some nomenclature features of mutations.When building the network,we implemented bi-directional long short-term memory(Bi-LSTM)layers to learn a general form of mutation mentions while capturing long-term context information and fully-connected layers to improve the fitting capability.The input of the network was concatenation of word vectors which were training from token embeddings.And the output of the network was decoded by the Viterbi algorithm to optimize the initial labeled sequence.On top of that,regular expression patterns were used to label the mutation mentions,which provided extra information to optimize the initial output.Results While training and testing on NCBI tm Var mutation corpus,our algorithm achieved F1-score of 91.59%which performed better than current reported systems.Conclusion This algorithm shows great performance without complicated mutual feature engineering,which can be used in rapidly positioning mutation entities in biomedical literature,and facilitate further research of relation extraction.
作者 罗哲恒 佟凡 赵东升 LUO Zhe-heng;TONG Fan;ZHAO Dong-sheng(Information Center,Academy of Military Medical Sciences,Academy of Military Sciences,Beijing 100850,China)
出处 《军事医学》 CAS CSCD 北大核心 2018年第11期872-876,共5页 Military Medical Sciences
基金 国家重点研发计划资助项目(2016YFC0901900)
关键词 基因变异 命名实体识别 深度神经网络(计算机) 单词的表征向量 gene mutation named entity recognition deep neural network(computer) word distributed representation
  • 相关文献

同被引文献15

引证文献1

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部