摘要
现有识别机构名简称的方法多依赖全称,也依赖简称的组成形式.针对这两个问题,提出一种采用上下文特征匹配的机构名简称识别方法.本文提出的上下文特征分为机构名独有特征和干扰词与机构名相交特征,每一个特征赋予一个错误率权重,在不同错误率范围内,采用上下文特征匹配算法识别机构名简称.还通过建立干扰词表和扩展操作,进一步提高了识别的准确率与召回率.实验中,本文方法在封闭数据集上的F值达到92.23%.利用封闭数据集训练的特征和干扰词,在开放测试集上的F值取得70.28%.最后,与依赖全称生成简称的识别方法进行对比,本文方法识别出有匹配全称的简称和无匹配全称的简称,比依赖全称的识别方法有更好的效果.
Many existing methods of recognizing organization abbreviations rely on their full-names and component form of organiza-tion abbreviation. Instead of depending on them, thispaperpresents a new method using context feature to recognize the organization ab-breviation. The context feature which has an error rateconsists of the single feature possessed only by organization name and the inter-secting feature of noise word and organization name. This paper chooses the feature within a certain range of error rateand nsesfeaturematching algorithmto recognize the organization abbreviation. Italso establishes noise word list and uses extended operation to furtherimprove the precision rate and the recall rate. The F value of the paper is 92.23% in close set,and it can get the F value of 70.28%in open set making use of the context feature and noise word list trained in close set. At last,comparing with the method based on gen-erating abbreviation from full-name,this paper achieves a better experimental result. Whether the abbreviations match the full-name,they all can be recognized by this method.
出处
《小型微型计算机系统》
CSCD
北大核心
2015年第7期1432-1437,共6页
Journal of Chinese Computer Systems
基金
上海市科委重大项目(12dz1500205)资助
上海国际合作项目(13430710100)资助
关键词
机构名简称
上下文特征
相交特征
独有特征
特征匹配算法
干扰词
organization abbreviations
context features
intersecting features
single features
feature matching
noise words