摘要
目的基因变异命名实体的自动化正确识别是从生物医学文献中挖掘基因-变异-疾病关系知识的重要基础。该文提出一种以深度神经网络为主、结合维特比解码和正则表达式的组合算法,用于识别基因变异命名实体。方法受单词分布式表示的启发,提出一种深度分词策略,即以字母的大小写、数字和特殊符号将单词分开来捕捉变异名称中各部分的结构信息,其中最小的分词单位定义为token;使用Glo Ve训练深度分词的token向量,一个单词的全部token向量用于训练该单词的词向量。以句子的词向量序列为输入,利用一个双向长短期记忆网络(Bi-LSTM)学习变异名称的一般形式并捕捉上下文信息,后接一个全联接层以提高拟合能力,得到词的标签概率序列作为初步输出;随后采用维特比算法对初步输出进行优化,最后增加正则表达式匹配的结果完成识别。结果该算法经在NCBI tm Var语料库上训练和测试,取得了91. 59%的F1值,高于目前国际上已公开报道的识别系统。结论该算法避免了复杂的人工特征工程并表现出优越的性能,可用于快速定位生物医学文本中的变异实体,为进一步的关系抽取研究打下基础。
Objective Automatic recognizing mutation mentions plays a fundamental and critical role in mining genevariant-disease relation knowledge from biomedical literature.Methods In this paper,we proposed an advanced algorithm for mutation mentions detection,which consisted of the deep neural network,Viterbi decoding and regular expression.Inspired by the distributed representation of words,we divided each word by letters of difference case,numbers and special characters into tokens for training a token embedding which could capture some nomenclature features of mutations.When building the network,we implemented bi-directional long short-term memory(Bi-LSTM)layers to learn a general form of mutation mentions while capturing long-term context information and fully-connected layers to improve the fitting capability.The input of the network was concatenation of word vectors which were training from token embeddings.And the output of the network was decoded by the Viterbi algorithm to optimize the initial labeled sequence.On top of that,regular expression patterns were used to label the mutation mentions,which provided extra information to optimize the initial output.Results While training and testing on NCBI tm Var mutation corpus,our algorithm achieved F1-score of 91.59%which performed better than current reported systems.Conclusion This algorithm shows great performance without complicated mutual feature engineering,which can be used in rapidly positioning mutation entities in biomedical literature,and facilitate further research of relation extraction.
作者
罗哲恒
佟凡
赵东升
LUO Zhe-heng;TONG Fan;ZHAO Dong-sheng(Information Center,Academy of Military Medical Sciences,Academy of Military Sciences,Beijing 100850,China)
出处
《军事医学》
CAS
CSCD
北大核心
2018年第11期872-876,共5页
Military Medical Sciences
基金
国家重点研发计划资助项目(2016YFC0901900)