基于word2vec的大中华区词对齐库的构建被引量：6

Word2vec Based Word Alignment Corpus for the Greater China Region

下载PDF

导出

摘要该文针对大陆、香港和台湾地区(简称大中华区)存在同一种语义但采用不同词语进行表达的语言现象进行分析。首先,我们抓取了维基百科以及简繁体新闻网站上的3 200 000万组大中华区平行句对,手工标注了一致性程度达到95%以上的10 000组大中华区平行词对齐语料库。同时,我们提出了一个基于word2vec的两阶段大中华区词对齐模型,该模型采用word2vec获取大中华区词语的向量表示形式,并融合了有效的余弦相似度计算方法以及后处理技术。实验结果表明我们提出的大中华区词对齐模型在以上两种不同文体的词对齐语料库上的F1值显著优于现有的GIZA++和基于HMM的基准模型。此外,我们在维基百科上利用该词对齐模型进一步生成了90 029组准确率达82.66%的大中华区词语三元组。 We deal with the linguistic phenomenon that different expressions to the same semantic meaning among the China's Mainland, Hong Kong and Taiwan, a. k. a. , the greater China region（GRC）. Firstly, we automatically crawl 3.2 million GCR parallel sentences from the wikipedia and the news website with simplified and traditional en coding, and then manually annotate 10 000 GCR parallel word alignment corpora with an annotation agreement of more than 95 %. Meanwhile, we present a 2-phase GCR word alignment model based on word2vec representation of the GCR words＇ the cosine similarity measure and other post-processing techniquest. Experiment results on the proposed 2 different word alignment corpus demenstrate the effectiveness of our GCR model which significantly outperforms the current GIZA＋＋ and HMM-based models. Furthermore, we generate 90,029 triples from wikipedia with accuracy over 82.66 %.

作者王明文徐雄飞徐凡李茂西

机构地区江西师范大学计算机信息工程学院

出处《中文信息学报》 CSCD 北大核心 2015年第5期76-83,共8页 Journal of Chinese Information Processing

基金国家自然科学基金(61462045 61402208 61462044) 国家语委"十二五"规划(YB125-99) 江西省自然科学基金(20132BAB201030 20151BAB207027 20151BAB207025)

关键词大中华区词对齐最长公共子序列 word2vec the greater China region word alignment the longest common subsequence word2vec

分类号 TP391 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献19

1Ayan N F, Dorr B J. Going beyond AER: An extensive analysis of word alignments and their impact on MT[C]//Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2006: 9-16.
2Takezawa T, Sumita E, Sugaya F, et al. Toward a Broad-coverage Bilingual Corpus for Speech Translation of Travel Conversations in the Real World[C]//Proceedings of the 3rd International Conference on Language Resources and Evaluatio. 2002: 147-152.
3Mihalcea R, Pedersen T. An evaluation exercise for word alignment[C]//Proceedings of the HLT-NAACL 2003 workshop on building and using parallel texts: data driven machine translation and beyond-Volume 3. Association for Computational Linguistics, 2003: 1-10.
4Mikolov T, Sutskever I, Chen K, et al. Distributed representations of words and phrases and their compositionality[C]//Proceedings of the advances in neural information processing systems. 2013: 3111-3119.
5Brown P F, Pietra V J D, Pietra S A D, et al. The mathematics of statistical machine translation: Parameter estimation[J]. Computational linguistics, 1993, 19(2): 263-311.
6Vogel S, Ney H, Tillmann C. HMM-based word alignment in statistical translation[C]//Proceedings of the 16th Conference on Computational linguistics-Volume 2. Association for Computational Linguistics, 1996: 836-841.
7Neubig G, Watanabe T, Sumita E, et al. An unsupervised model for joint phrase alignment and extraction[C]//Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1. Association for Computational Linguistics, 2011: 632-641.
8Songyot T, Chiang D. Improving word alignment using word similarity[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2014: 1840-1845.
9Kondo S, Duh K, Matsumoto Y. Hidden Markov Tree Model for Word Alignment[C]//Proceedings of the 8th Workshop on Statistical Machine Translation. 2013: 503.
10Chang Y W, Rush A, DeNero J, et al. A Constrained Viterbi Relaxation for Bidirectional Word Alignment[J]. Annual Meeting of the Association for Computational Linguistics. 2014: 1481-1490.

同被引文献45

1马永军.基于依存语法的自然语言处理现状及前景展望[J].学术交流,2007(10):137-140. 被引量：4
2何维,王宇.基于句子的文本表示及中文文本分类研究[J].情报学报,2009,28(6):839-843. 被引量：3
3苏翔,李玉鑑.GIZA++计算性能分析[J].计算机工程与科学,2010,32(5):147-149. 被引量：4
4田久乐,赵蔚.基于同义词词林的词语相似度计算方法[J].吉林大学学报（信息科学版）,2010,28(6):602-608. 被引量：175
5程传鹏,吴志刚.一种基于知网的句子相似度计算方法[J].计算机工程与科学,2012,34(2):172-175. 被引量：27
6张建娥.基于多特征融合的中文文本关键词提取方法[J].情报理论与实践,2013,36(10):105-108. 被引量：16
7刘丹丹,彭成,钱龙华,周国栋.《同义词词林》在中文实体关系抽取中的作用[J].中文信息学报,2014,28(2):91-99. 被引量：25
8顾益军,夏天.融合LDA与TextRank的关键词抽取研究[J].现代图书情报技术,2014(7):41-47. 被引量：69
9杨阳,刘龙飞,魏现辉,林鸿飞.基于词向量的情感新词发现方法[J].山东大学学报（理学版）,2014,49(11):51-58. 被引量：25
10熊富林,邓怡豪,唐晓晟.Word2vec的核心架构及其应用[J].南京师范大学学报（工程技术版）,2015,15(1):43-48. 被引量：67

引证文献6

1李晓,解辉,李立杰.基于Word2vec的句子语义相似度计算研究[J].计算机科学,2017,44(9):256-260. 被引量：54
2严馨,李思远,徐璐,周枫,郭剑毅.基于柬汉双语词对齐语料构建柬埔寨语依存树库[J].山西大学学报（自然科学版）,2018,41(3):511-519.
3王闻慧.基于谷歌翻译及Doc2vec的中英句子相似度计算[J].电脑知识与技术,2019,15(5X):224-227.
4张佳宁,严冬梅,王勇.基于word2vec的语音识别后文本纠错[J].计算机工程与设计,2020,41(11):3235-3240. 被引量：17
5杨延娇,赵国涛,王丕栋.基于语义与情感的句子相似度计算方法[J].计算机工程与应用,2021,57(16):151-158. 被引量：1
6杨延娇,赵国涛,袁振强,韩家臣.融合语义特征的TextRank关键词抽取方法[J].计算机工程,2021,47(10):82-88. 被引量：11

二级引证文献83

1孟旭,谢靖,李春旺.基于核心主题特征的作者身份识别研究[J].知识管理论坛,2023(5):351-364.
2洪海蓝,李文林,杨涛,李玥,梅文静.基于知识图谱的海洋中药智能问答系统的设计与实现[J].世界科学技术-中医药现代化,2023(6):1935-1941. 被引量：2
3黄鲁成,滕旭东,苗红,吴菲菲,王小丽.创新政策中创新激励与负责任创新平衡态评估研究[J].中国软科学,2018(5):25-38. 被引量：14
4李琳,李辉.一种基于概念向量空间的文本相似度计算方法[J].数据分析与知识发现,2018,2(5):48-58. 被引量：19
5王松松,高伟勋.基于高校官网的校情简介数据分析方法[J].计算机与现代化,2018(8):66-72.
6谢先章,王兆凯,李亚星,冯旭鹏,刘利军,黄青松.基于卷积神经网络的跨领域语义信息检索研究[J].计算机应用与软件,2018,35(8):73-78. 被引量：3
7余培,行鸿彦,刘刚.中文评论情感分析方法研究[J].电子测量与仪器学报,2018,32(12):197-203. 被引量：5
8梁敬东,崔丙剑,姜海燕,沈毅,谢元澄.基于word2vec和LSTM的句子相似度计算及其在水稻FAQ问答系统中的应用[J].南京农业大学学报,2018,41(5):946-953. 被引量：17
9何颖刚,王宇.一种基于字向量和LSTM的句子相似度计算方法[J].长江大学学报（自然科学版）,2019,16(1):88-94. 被引量：4
10纪明宇,王晨龙,安翔,牟伟晔.面向智能客服的句子相似度计算方法[J].计算机工程与应用,2019,55(13):123-128. 被引量：13

1魏明,张运楚,孙霞.从全测量范围衡量多个传感器的一致性的概念及标准[J].山东建筑工程学院学报,1999,14(4):65-67.
2楚念良.释放前沿设计与先进制造业的3D新威力——SolidWorks World 2005中国用户大会与ICT2005年用户大会纪实[J].模具工程,2005(8):4-4.
3陈兴俊,魏晶晶,廖祥文,简思远,陈国龙.基于词对齐模型的中文评价对象与评价词抽取[J].山东大学学报（理学版）,2016,51(1):58-64. 被引量：4
4麦热哈巴·艾力,王志洋,吐尔根·依布拉音.一种提高维吾尔语-汉语词语对齐的方法研究[J].小型微型计算机系统,2012,33(11):2551-2555. 被引量：9
5周蓝海,蔡东风.多策略英汉词对齐方法的研究[J].计算机工程与设计,2009,30(17):4138-4140. 被引量：5
6罗克韦尔自动化（中国）有限公司[J].现代制造,2016,0(37):55-55.
7张里.碰撞中、港、台三地程序员[J].程序员,2003(7):14-17.
8苏翔,李玉鑑.GIZA++计算性能分析[J].计算机工程与科学,2010,32(5):147-149. 被引量：4
9李莉.释放3D的威力 SolidWorks World 2005中国用户大会成功在沪举办[J].现代制造,2005(19):14-14.
10付博,刘挺.基于跨社交媒体检索的微博消费对象识别[J].计算机科学与探索,2015,9(10):1247-1255. 被引量：2

中文信息学报

2015年第5期

浏览历史

内容加载中请稍等...

基于word2vec的大中华区词对齐库的构建被引量：6

参考文献19

同被引文献45

引证文献6

二级引证文献83

相关作者

相关机构

相关主题

浏览历史

基于word2vec的大中华区词对齐库的构建 被引量：6

参考文献19

同被引文献45

引证文献6

二级引证文献83

相关作者

相关机构

相关主题

浏览历史

基于word2vec的大中华区词对齐库的构建被引量：6