摘要
生物事件抽取是生物文本挖掘领域的一个重要分支,而触发词识别作为事件抽取的重要子过程,已经吸引了众多的关注。现有的触发词识别方法多为浅层的一阶段方法,训练代价较大,且需要丰富的领域知识抽取大量特征,人工成本较高。因此,该文提出了一种基于两阶段和双向LSTM神经网络的触发词识别方法。首先,将触发词识别分为识别和分类两个阶段,有效地缓解了训练过程中存在的类不平衡问题;其次,在两个阶段中均采用目前性能较好的双向LSTM神经网络来完成二分类任务和多分类任务,避免了浅层机器学习方法抽取人工特征时的代价。此外,利用PubMed数据库下载大规模语料训练带有依存关系的词向量,获得了更加丰富的语义信息,从而有效地提高了触发词的识别性能。该文方法在生物事件抽取通用语料MLEE上已获得目前最好抽取性能,F值为78.46%。
The trigger detection is of significance in the biomedical event extraction.The existing trigger detection methods are almost one-stage methods based on shallow machine learning,which demands on heavy training on the rich domain knowledge and sufficient manual features.In this paper,we propose a two-stage trigger detection method based on Bidirectional Long Short Term Memory(BLSTM),which divides trigger detection into recognition stage and classification stage.This approach can relieve the issue of imbalance class effectively,and avoid the cost of manual feature extraction.In addition,to obtain more semantic information,we use the large-scale corpus downloaded from the PubMed database to train the dependency word embeddings,which effectively improves the recognition performance of trigger detection.On the multi-level event extraction(MLEE)corpus dataset,our method achieves an F-score of 78.46%,which outperforms the state-of-the-art systems.
出处
《中文信息学报》
CSCD
北大核心
2017年第6期147-154,共8页
Journal of Chinese Information Processing
基金
国家自然科学基金(61672126)
关键词
触发词识别
两阶段方法
双向LSTM
依存词向量
trigger detection
two-stage methods bidirectional LSTM
dependency word embeddings