摘要
跨语系术语对齐质量不高,原因在于其依赖于低质量的术语抽取与对齐.提出的多策略融合Giza++(AGiza)的术语对齐法,为提高术语抽取质量,用首尾词性规则提高召回率,用独立过滤、停用过滤提高准确率,再识别共句术语对.为提高术语对齐的对准率:基于独立度、停用度,提出独立相关度、停用相关度;由种子对相关度和单词关联度概率加组合成语义相关度;根据首尾对齐情况,提出首尾相关度,并去除值为0者;基于词性组成特征,构造词性相似度;由GIZA++计算得到g值;经过属性的相关系数分析后,乘法组合各属性构造术语对齐度a;最后,过滤a超过术语对齐阈值(由召回率设定)的术语对.实验结果表明,AGiza术语对齐,可有效地处理跨语系术语对齐,质量高于GIZA++,Dice,Φ2,LLR,K-VEC及DKVEC.
The quality of cross-phylum term alignment depends on the quality of term extraction and alignment method. This paper proposes an automatic term alignment based on advanced multi-strategy and Giza++ (AGiza) integration. By analyzing the properties of the term extraction performed by using some existing methodologies in the literature, the rules of the first and the last part of speech of strings are designed to increase the recall rate. Methods that are applied for the purpose of increasing the precision of the term extraction include: (1) independence filter; (2) stopping filter; and (3) recognition of the co-occurrence of terms in the sentence pairs. The following steps are also implemmented to increase the alignment quality: (1) design the degree of the independence correspondence based on the degree of independence; (2) construct the degree of the stopping correspondence based on the degree of stopping usage; (3) propose the degree of semantic correspondence that computed by the seed pairs' correspondence and word pairs' similarity based on additivity of probability; (4) construct the alignment correspondence degree of the first part and last part between the term pairs in order to cancel the term pairs whose value is equal to zero; (5) present the similarity degree of the part of speech between the term pairs considering the patterns that define the morphosyntactic structures of terms; and (6) obtain the value ofg based on GIZA++. The term-aligned degree (a) is computed by the six attributes of term pairs based on multiplication of probability after analyzing their correlations. Term pairs is extracted by select the term-aligned pairs based on the candidate term pairs whose a is more than the term-aligned threshold that make the tolerance of recall is less than 1%. The simulation results of Chinese-English term alignment show that automatic term alignment based on AGiza can be used to extract cross-phylum term pairs effectively. Furthermore, it outperforms GIZA++, the Dice coefficient, the Φ2 coefficient, the log-likelihood ratio, K-VEC and DKVEC.
出处
《软件学报》
EI
CSCD
北大核心
2015年第7期1650-1661,共12页
Journal of Software
基金
国防基础科学研究计划(Q172011A001)
关键词
术语对齐
多语言术语抽取
跨语言
跨语系
term alignment
multilingual term extraction
cross-language
cross-phylum