期刊文献+

基于多策略融合Giza++的术语对齐法 被引量:4

Automatic Term Alignment Based on Advanced Multi-Strategy and Giza++ Integration
下载PDF
导出
摘要 跨语系术语对齐质量不高,原因在于其依赖于低质量的术语抽取与对齐.提出的多策略融合Giza++(AGiza)的术语对齐法,为提高术语抽取质量,用首尾词性规则提高召回率,用独立过滤、停用过滤提高准确率,再识别共句术语对.为提高术语对齐的对准率:基于独立度、停用度,提出独立相关度、停用相关度;由种子对相关度和单词关联度概率加组合成语义相关度;根据首尾对齐情况,提出首尾相关度,并去除值为0者;基于词性组成特征,构造词性相似度;由GIZA++计算得到g值;经过属性的相关系数分析后,乘法组合各属性构造术语对齐度a;最后,过滤a超过术语对齐阈值(由召回率设定)的术语对.实验结果表明,AGiza术语对齐,可有效地处理跨语系术语对齐,质量高于GIZA++,Dice,Φ2,LLR,K-VEC及DKVEC. The quality of cross-phylum term alignment depends on the quality of term extraction and alignment method. This paper proposes an automatic term alignment based on advanced multi-strategy and Giza++ (AGiza) integration. By analyzing the properties of the term extraction performed by using some existing methodologies in the literature, the rules of the first and the last part of speech of strings are designed to increase the recall rate. Methods that are applied for the purpose of increasing the precision of the term extraction include: (1) independence filter; (2) stopping filter; and (3) recognition of the co-occurrence of terms in the sentence pairs. The following steps are also implemmented to increase the alignment quality: (1) design the degree of the independence correspondence based on the degree of independence; (2) construct the degree of the stopping correspondence based on the degree of stopping usage; (3) propose the degree of semantic correspondence that computed by the seed pairs' correspondence and word pairs' similarity based on additivity of probability; (4) construct the alignment correspondence degree of the first part and last part between the term pairs in order to cancel the term pairs whose value is equal to zero; (5) present the similarity degree of the part of speech between the term pairs considering the patterns that define the morphosyntactic structures of terms; and (6) obtain the value ofg based on GIZA++. The term-aligned degree (a) is computed by the six attributes of term pairs based on multiplication of probability after analyzing their correlations. Term pairs is extracted by select the term-aligned pairs based on the candidate term pairs whose a is more than the term-aligned threshold that make the tolerance of recall is less than 1%. The simulation results of Chinese-English term alignment show that automatic term alignment based on AGiza can be used to extract cross-phylum term pairs effectively. Furthermore, it outperforms GIZA++, the Dice coefficient, the Φ2 coefficient, the log-likelihood ratio, K-VEC and DKVEC.
出处 《软件学报》 EI CSCD 北大核心 2015年第7期1650-1661,共12页 Journal of Software
基金 国防基础科学研究计划(Q172011A001)
关键词 术语对齐 多语言术语抽取 跨语言 跨语系 term alignment multilingual term extraction cross-language cross-phylum
  • 相关文献

参考文献8

二级参考文献96

  • 1吕学强,吴宏林,姚天顺.无双语词典的英汉词对齐[J].计算机学报,2004,27(8):1036-1045. 被引量:11
  • 2张威,周昌乐.汉语隐喻理解的逻辑描述初探[J].中文信息学报,2004,18(5):23-28. 被引量:18
  • 3吕学强,张乐,黄志丹,胡俊峰.基于散列技术的快速子串归并算法[J].复旦学报(自然科学版),2004,43(5):948-951. 被引量:4
  • 4张锋,樊孝忠,许云.Chinese Term Extraction Based on PAT Tree[J].Journal of Beijing Institute of Technology,2006,15(2):162-166. 被引量:2
  • 5王斌.汉语语料库自动对齐研究(博士学位论文)[M].北京:中国科学院计算技术研究所,1999..
  • 6Oakes M P,Paice C D.Term extraction for automatic abstracting[M] //Bourigault D,Jacquemin C,L'Homme M-C.Recent Advances in Computational Terminology.John Benjamins Publishing Company,2001:353-370.
  • 7Fortuna B,Lavrac N,Velardi P.Advancing Topic Ontology Learning through Term Extraction[C].PRICAI 2008,LNAI 5351,2008:626-635.
  • 8Cerbah F,Euzenat J.Using Terminology Extraction to Improve Traceability from Formal Models to Textual Requirements[C].NLDB 2000,LNCS 1959,2001:115-126.
  • 9Bourigault D.Surface Grammatical Analysis for the Extraction of Terminological Noun Phrases[C] //Proceedings of COLING'92,1992:977-981.
  • 10Frantzi K T,Ananiadou S,Mima H.Automatic Recognition of Multi-word terms:the C-value/NC-value Method[J].International Journal on Digital Libraries,2000,3(2):115-130.

共引文献79

同被引文献40

引证文献4

二级引证文献11

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部