摘要
【目的】探讨冶金领域中文专利术语抽取模型的最优条件,用于有效地抽取冶金领域专利术语。【方法】使用尚不完善的核心语料库,在无需人工标引的情况下,采用条件随机场(CRFs)构建字角色标注的冶金领域中文专利术语识别模型。详细说明模型的构建过程,同时重点对比CFRs的各个因素(特征组合、字长窗口等)对识别效果的影响。【结果】实验结果表明字序列、级别特征、领域特征、温度特征的组合在字长窗口为3,c等于1,f等于1时,准确率达到94.26%,召回率达到94.37%,F1值达到94.5%。【局限】核心词典欠完善,使得部分词语标注不够准确;未与其他方法作详细比较,未详细说明CRFs的可靠性。【结论】CRFs在适当的角色和特征以及特征模板的组合下能较好地识别出冶金领域的中文专利术语。
[Objective] This paper proposed a model to extract metallurgy patent terms in Chinese effectively. [Methods] We created the model to automatically identify metallurgy patent terminologies in Chinese with the help of conditional random fields(CRFs) technology. This model was tested with an incomplete core corpus. We discussed the development process and then compared the impacts of various CRFs factors to this character-role-labeled model. [Results] The new model combined the character sequences, level features, areal features and temperature features of the patent terms. Its precision rate was 94.26%, the recall rate was 94.37%, and the FI value was 94.5%, while the length of the proximity window and the values of the parameter c and f were 3, 1, and 1 respectively. [Limitations] Some of the term labels were not accurate enough due to the incomplete core corpus. We did not compare our model with other methods to discuss the reliability of the CRFs. [Conclusions] The CRFs model could effectively identify the metallurgy patent terms in Chinese under appropriate working conditions.
出处
《现代图书情报技术》
CSSCI
2016年第6期28-36,共9页
New Technology of Library and Information Service
基金
江苏省自然科学基金项目"面向专利预警的中文本体学习研究"(项目编号:BK20130587)
江苏省"333"工程项目"面向知识服务的中文本体学习研究"(项目编号:BRA2015401)的研究成果之一
关键词
中文专利术语
条件随机场
术语抽取
序列标注
Chinese patent terminology CRFs Terminology extraction Sequence labeling