摘要
在目前的生物信息领域开放语料的术语抽取实验中,前2000多个双字词的精度已经达到了90.36%,但是三字以上的词的抽取精度只有66.63%,多字词的抽取成为了名词术语自动抽取的一个难点问题。针对该难点,提出综合C-value参数在长术语抽取方面的优势,并与术语抽取中的互信息参数相结合的策略来识别术语。实验结果表明,长术语抽取正确率为75.7%,召回率为68.4%,F测量值为71.9%,高于相同语料下的其他方法。
In current experimental results of term recognition on biology information open corpus,more than 2000 anterior Chinese phrases composed of two characters has reached the precision of 90.36%,but the recognition precision of Chinese phrases composed of three or more characters is only 66.63% .So the recognition of Chinese phrases with multiple characters becomes a difficulty in automatic recognition of noun terminologies.To resolve this,a strategy of term recognition for biology information is proposed in this paper.It integrates C-value parameter which has the predominance in long terminology's recognition with the parameter of mutual information of term recognition.Experimental result shows,for long terminologies,the recognition precision is 75.7%,the recall rate is 68.4%,and the F-measure is 71.9%,all are higher than those obtained with other methods on the same corpus.
出处
《计算机应用与软件》
CSCD
2010年第4期108-110,共3页
Computer Applications and Software
基金
江苏省现代企业信息化应用支撑软件工程技术研究开发项目(SX200907)
黑龙江省博士后基金(520415029)
江苏省"青蓝"工程(2008)
关键词
术语抽取
C值
互信息
Term recognition C-value Mutual information