摘要
生物医学文献信息抽取对充分挖掘利用生物医学领域取得的重要成果,促进生物医学的进一步发展具有重要意义。本文针对生物医学缩略语的分析理解问题,提出了基于加权投票K—近邻法的生物医学缩略语消歧算法。该算法基于"One Sense Per Discourse"假设自动生成带类标实例数据,消歧特征选用能表达文本主题的全局特征词,分类算法采用加权投票K—近邻法。在包含177762篇Medline摘要的真实语料上进行的实验表明,本文所提出的算法明显优于相关工作中的算法。此外,实验还表明,对于缩略语消歧,加权投票K—近邻法与经典K—近邻法相比,不但具有高的预测准确率,而且性能更加稳定。
Information extraction from biomedical literature is very useful for utilizing the achievements in biomedical field and promoting further improvement of Biology and Medicine, This paper, aiming at biomedical abbreviation analysis and understanding, proposes an approach for disambiguating biomedical abbreviations based on K nearest neighbor (K-NN) with weighted voting, In the approach, the samples with labels are generated automatically based on the hypothesis of "One Sense Per Discourse". And the wordsdescribing the topic of a discourse are chosen as the features for abbreviation disambiguation, The classification model used in the approach is based on K-NN with weighted voting. The experimental results on a testing set containing 177 762 Medline abstracts show that the ap proach proposed in the paper can obtain higher precision than others in related work. The experiments also prove that K-NN with weighted voting can get not only higher precision, but also better stability in comparison with the traditional K-NN in abbreviation disambiguation task.
出处
《中文信息学报》
CSCD
北大核心
2008年第2期18-23,共6页
Journal of Chinese Information Processing
基金
国家自然科学基金资助项目(90409007)
关键词
计算机应用
中文信息处理
生物医学信息抽取
缩略语消歧
加权投票K-近邻法
computer application
Chinese information processing
biomedical information extraction
abbreviationdisambiguation
K-NN with weighted voting.