摘要
针对生物术语的缩写词识别问题,提出了一种基于逆序文本对齐的搜索算法,它实现简单,不需要大量训练数据。该算法在对Medstract标准语料库测试中,准确率和召回率分别为91%和93%;在对包含128篇全文文本的大测试集SBQTL测试中,准确率和召回率分别为96%和84%。在详细分析实验结果后,提出了将文本预处理以及语法规则等自然语言处理技术融入搜索算法作为未来工作的方向。
The identification of abbreviations in biomedical literatures is important for all text mining tools .An abbreviation recognition algorithm based on reverse text alignment was proposed and it is easy to be implemented and need not training data . The algorithm achieves 91%precision and 93%recall on the gold standard corpus "Medstract"and 96%precision and 84%re-call on the larger test data that includes 128 full text literatures .After analyzing the errors produced by the approach , the further improvement work of the approach was discussed .A recognition tendency was proposed that the natural language treatment tech -nology of text pretreatment and grammatical rules should be combined into text alignment algorithm .
出处
《武汉理工大学学报(信息与管理工程版)》
CAS
2014年第5期592-595,604,共5页
Journal of Wuhan University of Technology:Information & Management Engineering
基金
黑龙江省教育厅海外学人科研基金资助项目(1253HQ001)
东北农业大学博士科研启动基金资助项目(2012RCB54)
关键词
文本挖掘
文本对齐
缩写词识别
生物文献挖掘
text mining
text alignment
abbreviation recognition
biomedical literatures mining