摘要
通过对新能源汽车领域中文专利文献中术语特点的分析,提出利用条件随机场模型,分别基于三词位、四词位和六词位的字序列标注进行术语抽取的方法。以字为切分粒度,避免在术语抽取过程中因分词原因导致术语识别错误问题,并探讨不同词位标注集对术语抽取性能的影响。实验结果表明,基于六词位字标注的条件随机场模型术语抽取的性能最好,准确率、召回率和F值优于对比方法中基于词、词性、词长等信息作为特征的抽取方法,验证了所提方法的有效性。
After analyzing the features of terms in the Chinese patent documents about new energy vehicles,an optimization method that used the conditional random fields model to extract the terminologies based on the word sequence of three,four and six word tagging was proposed.Single character was used as the shard granularity and the recognition error caused by word segmentation in term extraction was avoided.The extraction performances on different word level tagging sets were discussed.Experimental results show that the condition of the six word tagging is the best in conditional random fields model,and the accuracy rate,recall rate and F values are better than contrast method using word,word POS,word length and other information as features to extract terms,thus verifying the effectiveness.
作者
王健
殷旭
吕学强
徐丽萍
WANG Jian;YIN Xu;LYU Xue-qiang;XU Li-ping(Beijing Key Laboratory of Internet Culture and Digital Dissemination Research,Beijing Information Science and Technology University,Beijing 100101,China;Beijing Research Center of Urban System Engineering,Beijing 100089,China)
出处
《计算机工程与设计》
北大核心
2019年第1期279-284,共6页
Computer Engineering and Design
基金
国家自然科学基金项目(61671070)
北京成像技术高精尖创新中心基金项目(BAICIT-2016003)
国家社会科学基金重大基金项目(14@ZH036)
国家语委重点基金项目(ZDI135-53)
国家语委重大课题基金项目(ZDA125-26)