摘要
随着基因和蛋白质序列的发布和分子生物学研究的发展,其相关的数据呈指数级增长,因此如何从海量的相关文献中直接获取生物学家研究领域的相关信息变得迫在眉睫,识别生物文献中的命名实体如蛋白质、基因、脱氧核糖核酸名称等成为生物信息学中信息抽取的最基本任务。介绍了国际同类研究中生物命名实体识别的各种方法,重点介绍了蛋白质名称识别的相关方法、所用资源、实验结果及与国际同类研究的比较结果。
The genome sequence has ushered in a new era of rapid and exponential growth of data related to the biology community. Thus, there is a clear need in this area for automatic methods of extracting specific information directly relating to the interests of biology researchers. Name Entity(NE) such as protein, gene, DNA, etc. recognized from biological literature is a fundamental task in information extraction of bioinformatics. This paper introduces various methods of biological name entity recognition in international research on this area. Then the methods are presented with the relevant corpus and experiment resuits for protein name recognition. The promising results are gotten compared with the other state-of-the-art research.
出处
《计算机应用研究》
CSCD
北大核心
2007年第1期100-102,共3页
Application Research of Computers
基金
国家自然科学基金资助项目(60302021)
关键词
生物信息
命名实体识别
机器学习
特征选择
Bioinformatics
Name Entity Recognition
Machine Learning
Feature Selection