摘要
识别一篇生物医学文献中的核心实体是准确提取该文献信息的前提。针对目前生物医学文献实体识别和筛选方法的局限性,提出了基于LSTM的生物医学核心实体提取模型。该模型以LSTM为核心,通过更为优秀的词向量和输入生成规则改良模型输入,使用双向LSTM模型改进处理过程,将结果保存为树形结构并对该树进行合理剪枝获取标注链,实现输出结果处理,最终使实体识别的F1值达到了89.35%。此外,在核心实体筛选过程中,基于TF/IDF算法规则,充分考虑了词频、位置、逆文档频率等因素,使核心实体筛选的F1值达到了76.85%。
Identifying the core entities in a biomedical document is a prerequisite for accurate extraction of important information of the document.In view of the difficulties of entity and the limitations of existing methods of entity recognition and core entity screening in biomedical literature,a model of biomedical core entity extraction based on LSTM is proposed in this paper.The model takes LSTM as the core,applies the more excellent word vector and input generation rules to improve the model input,and employs the two-dimensional LSTM model to improve model of the process,The results are saved into the tree structure and reasonable pruning of the tree to achieve the output chain annotation way to obtain.Entity recognition F1 value reached 89.35%.In addition,in the process of core entity screening,the factors such as word frequency,location and inverse document frequency are fully taken into account on the premise of TF/IDF algorithm rules,and the F1 value of core entity screening is up to76.85%.
作者
唐颖
曹春萍
TANG Ying;CAO Chun-ping(University of Shanghai for Science and Technology School of Optical-Electrical and Computer Engineering,Shanghai 200093,China)
出处
《软件导刊》
2018年第5期132-137,共6页
Software Guide
基金
国家自然科学基金项目(61402288)
关键词
实体识别
改进词向量
双向LSTM
剪枝策略
核心实体筛选
entity recognition
improved word vector
bidirectional LSTM
pruning strategy
core entity screening