英中可比语料库中多词表达自动提取与对齐被引量：12

Automatic extraction and alignment of multiword expressions from English-Chinese comparable corpus

下载PDF

导出

摘要多词表达(MWE)不仅用来提高当前机器翻译系统质量,而且也用于跨语言检索和数据挖掘等其他自然语言处理领域。为此,提出了基于语义模板与基于统计工具相结合的方法从三元组可比语料库中自动提取本族英语MWE。采用基于词表和分布方法计算词语间的相似度,扩大MWE覆盖范围。利用GIZA++对齐算法提取对译的中文MWE,依据统计方法计算互译概率信息,根据概率大小,选择最佳英汉MWE互译对。实验结果表明上述方法可以有效提高MWE提取和对齐的准确率。 Multiword Expressions（MWE） are important for practical applications, such as machine translation（henceforth, MT） ,multilingual information retrieval,data mining and other natural language processing.A method of combining semantic template and statistical tool is proposed for automatically extracting native English MWE from three-tuple comparable corpus. Thesaurus-based and distributional methods are harnessed to calculate the semantic relations between words for improving MWE coverage.GIZA＋＋ is executed to align words at sentence level, aiming at obtaining Chinese MWE candidates.For each native English MWE, all of the Chinese MWE candidates are collected and sorted according to their co-occurrence affinity. Only the top one is accepted as true Chinese translation of the given English MWE.Experimental results show the proposed technique improves MWE extraction and alignment efficiently.

作者肖健徐建徐晓兰袁琦

机构地区中国电子信息产业发展研究院

出处《计算机工程与应用》 CSCD 北大核心 2010年第31期130-134,187,共6页 Computer Engineering and Applications

基金国家自然科学基金No.60872118~~

关键词三元组可比语料库多词表达(MwE) 语义模板 three-tuple comparable corpus multiword expressions（MWE） semantic template

分类号 TP391 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献26

1Wakaki H, Fujii H, Suzuki M, et al.Abbreviation generation for Japanese multi-word expressions[C]//Proceedings of the Workshop on Multiword Expressions:Identification,Interpretation,Disambiguation, Applications, 2009: 73-80.
2de Medeiros Caseli H,Villavicencio A,Machado A, et al.Statistically-driven alignment-based multiword expression identification for technical domains[C]//Proeeedings of the Workshop on Multiword Expressions: Identification, Interpretation, Disambiguation, Applications, 2009:1-8.
3Ren Zhixiang,Lu Yajuan, Cao Jie, et al.Improving statistical machine translation using domain bilingual multiword expressions[C]// Proceedings of the Workshop on Multiword Expressions: Identification, Interpretation, Disambiguation, Applications, 2009: 47-54.
4Rayson P, Xiao Jian, Wong A, et al.Quantitative analysis of translation revision: contrastive corpus on native english and chinese translationese[C]//XVIII FIT World Congress, 2008, Shanghai, China, 2008.
5Ramisch C, Schreincr P,Idiart M,et aLAn evaluation of methods for the extraction of multiword expressions[C]//Proccedings of the LREC Workshop Towards a Shared Task for Multiword Expressions, 2008: 50-53.
6Van de Cruys T,Moir'on B V.Semantics-based multiword expression extraction[C]//Proeeedings of the Workshop on A Broader Perspective on Multiword Expressions,2007:25-32.
7Rayson P.Falling foul of multiword expressions[C]//Proceedings of Lancaster University and CCID Joint Workshop on Chinese Multi-Word Expression(MWE) and Machine Translation, 2006: 8-40.
8Piao S S L.MWE and translation[C]//Proceedings of Lancaster University and CCID Joint Workshop on Chinese Multi-Word Expression(MWE) and Machine Translation,2006:53-54.
9Piao S S L, Sun Guangfan, Rayson P, et al.Automatic extraction of Chinese multiword expressions with a statistical tool[C]// Proceedings of the Workshop on Multi-word Expressions in a Multilingual Context,2006:17-24.
10Katz G,Giesbrecht E.Automatic identification of non-compositional multi-word expressions using Latent Semantic Analysis[C]// Proceedings of the Workshop on Multiword Expressions: Identifying and Exploiting Underlying Properties (COLING/ACL' 06) ,2006:12-19.

共引文献104

1王凯,周建国,夏德麟,晏蒲柳,董伟钛.基于支持向量机的中文文本自动分类研究[J].计算机应用研究,2005,22(11):61-63. 被引量：3
2钱兵,王永成,高凯.面向搜索引擎的自然语言理解的设计与实现[J].计算机应用研究,2006,23(12):260-262. 被引量：9
3秦春秀,赵捧未,刘怀亮.词语相似度计算研究[J].情报理论与实践,2007,30(1):105-108. 被引量：30
4商鹏,王晓琳.基于用户上下文的新闻服务机制研究[J].计算机工程与设计,2007,28(4):955-958.
5张映海,何中市,陈永锋.搜索引擎结果中Web文档的排序研究[J].计算机与数字工程,2007,35(2):126-129. 被引量：2
6徐德智,王怀民.基于本体的概念间语义相似度计算方法研究[J].计算机工程与应用,2007,43(8):154-156. 被引量：34
7徐德智,C.Onyango,王怀民.上位本体中语义相似度的计算及其实现[J].计算技术与自动化,2007,26(2):50-52.
8夏天.汉语词语语义相似度计算研究[J].计算机工程,2007,33(6):191-194. 被引量：63
9许可,迟名远,王成友,蔡宣平.基于语料库相似度的语料选择[J].计算机工程,2007,33(17):231-233.
10王广正,王喜凤.基于知网语义相关度计算的词义消歧方法[J].安徽工业大学学报（自然科学版）,2008,25(1):71-75. 被引量：10

同被引文献101

1毕崇武,叶光辉,李明倩,曾杰妍.基于标签语义挖掘的城市画像感知研究[J].数据分析与知识发现,2019,3(12):41-51. 被引量：7
2刘正光.《体验哲学—体验心智及其对西方思想的挑战》述介[J].外语教学与研究,2001,33(6):465-469. 被引量：23
3吕学强,吴宏林,姚天顺.无双语词典的英汉词对齐[J].计算机学报,2004,27(8):1036-1045. 被引量：11
4邓丹,刘群,俞鸿魁.基于双语词典的汉英词语对齐算法研究[J].计算机工程,2005,31(16):45-47. 被引量：8
5丁国栋,白硕,王斌.一种基于局部共现的查询扩展方法[J].中文信息学报,2006,20(3):84-91. 被引量：44
6HAZEM A.Bilingual lexicon extraction from comparable corpo-ra as metasearch[C]∥Proceedings of the 4th Workshop onBuilding and Using Comparable Corpora,Oregon,USA,2011:112-130.
7EISELE A,XU Jia.Improving machine translation performanceusing comparable corpora[C]∥Proceedings of the 3rd Work-shop on Building and Using Comparable Corpora,Valletta,Malta,2010:57.
8ZWEIGENBAUM P.Introduction[C]∥Proceedings of the4th Workshop on Building and Using Comparable Corpora,Ore-gon,USA,2011:14-17.
9RAYSON P.New trends in corpus linguistics for translationstudies[M]∥Proceedings of Workshop on Corpus Linguistics&Machine Translation Applications,2008.
10夏云,李德凤.可比语料量化比较分析与应用文体翻译[C]∥2008年上海第18届世界翻译大会论文集,上海,2008:173-176.

引证文献12

1王毅,肖健,袁琦,宋金平,李强.三元组可比语料库自动剖析在情报智能处理中的研究与应用[J].情报理论与实践,2012,35(4):94-98.
2袁琦,肖健,宋金平,朱姝,万缨,许亮.三元组可比语料库自动剖析技术研究与应用[J].计算机工程与应用,2012,48(16):129-132.
3麦热哈巴.艾力,阿孜古丽.夏力甫,吐尔根.依布拉音.维吾尔语多词表达抽取方法研究[J].计算机工程与应用,2014,50(8):26-30. 被引量：3
4原伟,易绵竹.基于维基百科的俄汉可比语料库构建及可比度计算[J].山东大学学报（理学版）,2017,52(9):1-6. 被引量：3
5原伟.俄汉新闻可比语料库的构建、评估及应用展望[J].解放军外国语学院学报,2017,40(6):113-120. 被引量：9
6唐亮,席耀一,彭波,刘香伟,易绵竹.基于词向量的越汉跨语言事件检索研究[J].中文信息学报,2018,32(3):64-70. 被引量：3
7张嘉伟,刘越莲.基于可比语料库的“悲伤”情绪隐/转喻对比研究——以歌德和李白诗歌为例[J].外语教学,2018,39(4):46-51. 被引量：10
8安亚巍,操晓春,罗顺.面向语料的领域主题词表构建算法[J].计算机科学,2018,45(B06):396-397. 被引量：5
9龚双双,陈钰枫,徐金安,张玉洁.基于网络文本的汉语多词表达抽取方法[J].山东大学学报（理学版）,2018,53(9):40-48. 被引量：5
10丘心颖,陈汉武,陈源,谭立聪,张皓,肖莉娴.融合Self-Attention机制和n-gram卷积核的印尼语复合名词自动识别方法研究[J].湖南工业大学学报,2020,34(3):1-9. 被引量：2

二级引证文献51

1Xia Zhao,Wen Zhou.A Corpus-Based Explanation of Metaphors in A House Divided[J].Language and Semiotic Studies,2019,5(4):108-127.
2王玮,郭创拓.科技金融数据信息流通的现实风险与法律监管[J].法治论坛,2023(3):34-49. 被引量：1
3张海军.维吾尔语短语自动抽取研究进展[J].计算机科学与探索,2015,9(12):1420-1429. 被引量：3
4于洁.互联网定义挖掘:多特征N-gram Plus分类方法[J].海南师范大学学报（自然科学版）,2017,30(3):253-260.
5张瑞华,王乐乐.2017国内语料库研究综述[J].天津外国语大学学报,2018,25(6):134-148. 被引量：1
6宋琪.认知视阈下情感构建的隐转喻研究——以昆曲《西厢记》为例[J].广东第二师范学院学报,2019,39(1):77-83.
7王闻慧.基于谷歌翻译及Doc2vec的中英句子相似度计算[J].电脑知识与技术,2019,15(5X):224-227.
8原伟.可比语料库构建与可比度计算研究综述[J].电脑知识与技术,2019,15(8Z):224-227. 被引量：2
9朱珊珊,原伟.面向俄文情感分析的新闻评论语料库建设与应用[J].外语学刊,2020,0(1):24-29. 被引量：5
10王琳,刘伍颖.马来语领域多词组无监督识别[J].中国科学技术大学学报,2019,49(7):517-523.

1席彩丽.基于内容的图像检索高层语义处理方法[J].图书情报工作,2009,53(9):121-124.
2梁正平,纪震,刘小丽.基于语义模板的问答系统研究[J].深圳大学学报（理工版）,2007,24(3):281-285. 被引量：6
3胡小鹏,袁琦,耿鑫辉,朱姝.构建和剖析中英三元组可比语料库[J].计算机工程与应用,2014,50(13):153-157. 被引量：5
4关于英国Lancaster大学与中国电子信息产业发展研究院（赛迪集团）联合召开中文多字表达（MWE）与机器翻译研讨会征文通知[J].中国计算机用户,2006(10):47-47.
5赵晓丽.语义分析方法在web信息过滤中的应用[J].科技通报,2012,28(10):37-39. 被引量：3
6罗丹,姜建华,盛步云,杨明忠.制造网格中基于语义的服务发现技术及应用[J].武汉理工大学学报,2008,30(8):53-56. 被引量：2
7罗小聪.基于专用双语词典的查询扩展[J].现代计算机,2009,15(10):4-6.
8苏依拉,窦保媛,吉亚图.基于本体的蒙古语灾害信息检索模型[J].北京工业大学学报,2016,42(7):1017-1023. 被引量：4
9米尔夏提·力提甫,艾合买提·赛都拉.基于可比语料库的汉维术语抽取研究[J].新疆大学学报（哲学社会科学版）（维吾尔文）,2016,0(3):19-23.
10师光达,李芳.基于中间语言和可比语料库的双语词表构建[J].北京化工大学学报（自然科学版）,2016,43(2):98-102. 被引量：2

计算机工程与应用

2010年第31期

浏览历史

内容加载中请稍等...

英中可比语料库中多词表达自动提取与对齐被引量：12

参考文献26

共引文献104

同被引文献101

引证文献12

二级引证文献51

相关作者

相关机构

相关主题

浏览历史

英中可比语料库中多词表达自动提取与对齐 被引量：12

参考文献26

共引文献104

同被引文献101

引证文献12

二级引证文献51

相关作者

相关机构

相关主题

浏览历史

英中可比语料库中多词表达自动提取与对齐被引量：12