期刊文献+

融合主题模型及双语词向量的汉缅双语可比文档获取方法 被引量:2

Chinese-Burmese Comparable Document Acquisition Based on Topic Model and Bilingual Word Embedding
下载PDF
导出
摘要 缅甸语属于资源稀缺型语言,汉缅双语可比文档是获取平行句对的重要数据资源。该文提出了一种融合主题模型及双语词向量的汉缅双语可比文档获取方法,将跨语言文档相似度计算转化为跨语言主题相似度计算问题。首先,使用单语LDA主题模型分别抽取汉语、缅甸语的主题,得到对应的主题分布表示;其次,将抽取到的汉缅主题词进行表征得到单语的主题词向量,利用汉缅双语词典将汉语、缅甸语单语主题词向量映射到共享的语义空间,得到汉缅双语主题词向量,最后通过计算汉语、缅甸语主题相似度获取汉缅双语可比文档。实验结果表明,该文提出的方法得到的F1值比基于双语词向量方法提升了5.6%。 To collect Chinese-Burmese comparable documents, this paper proposes a acquisition method based on topic model and bilingual word embedding, treating the cross-language document similarity issue as cross-language topic similarity measurement. First, we use the monolingual LDA topic model to extract the Chinese and Burmese topics, respectively, and get the corresponding topics distribution representation. Then, we calculate the topic words for Chinese and Burmese documents, and get the Chinese-Burmese bilingual topic word embedding by mapping the monolingual word embedding into a shared semantic space according the Chinese-Burmese bilingual dictionary. The similarity of Chinese and Burmese document is finally decided by both topic embedding and bilingual word embedding. The experimental results show that the F1 obtained by the proposed method is increased by 5.6% than the baseline using just the word embedding.
作者 李训宇 毛存礼 余正涛 高盛祥 王振晗 张亚飞 LI Xunyu;MAO Cunli;YU Zhengtao;GAO Shengxiang;WANG Zhenhan;ZHANG Yafei(Faculty of Information Engineering and Automation,Kunming University of Science and Technology,Kunming,Yunnan 650500,China;Yunnan Key Laboratory of Artificial Intelligence,Kunming University of Science and Technology,Kunming,Yunnan 650500,China)
出处 《中文信息学报》 CSCD 北大核心 2021年第1期88-95,共8页 Journal of Chinese Information Processing
基金 国家自然科学基金(61732005,61662041,61761026,61866019,61972186) 国家重点研发计划(2019QY1802,2019QY1801) 云南省应用基础研究计划重点项目(2019FA023) 云南省中青年学术和技术带头人后备人才项目(2019HB006)。
关键词 主题模型 双语词向量 文档相似度 汉语—缅甸语 双语可比文档 topic model bilingual word embedding document similarity Chinese-Burmese bilingual comparable document
  • 相关文献

参考文献2

二级参考文献18

  • 1金博,史彦军,滕弘飞.基于语义理解的文本相似度算法[J].大连理工大学学报,2005,45(2):291-297. 被引量:80
  • 2宋玲,马军,连莉,张志军.文档相似度综合计算研究[J].计算机工程与应用,2006,42(30):160-163. 被引量:43
  • 3Philip Resnik.Parallel Strands:A Preliminary Investigation into Mining the Web for Bilingual Text[A].In:Third Conference of the Association for Machine Translation in the Americas (AMTA-98)[C],Langhorne,PA,Lecture Notes in Artificial Intelligence 1529,Springer,October,1998.
  • 4Philip Resnik.Mining the Web for Bilingual Text[A].In:37th Annual Meeting of the Association for Computational Linguistics (ACL'99)[C].College Park,Maryland,June 1999.
  • 5Wessel Kraaij Jian-Yun Nie.Embedding Web-based Statistical Translation Models in Cross-Language Information Retrieval[J].Computational Linguistics 29(3):381-419 (2003).
  • 6Noah A.Smith.Detection of Translational Equivalence.Bachelor Thesis(2001)[D],University of Maryland.
  • 7Noah A.Smith.From Words to Corpora:Recognizing Translation[A].In:Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP 2002)[C],Philadelphia,Pennsylvania.
  • 8Ralf Steinberger,Bruno Pouliquen,Johan Hagman.Cross-Lingual Document Similarity Calculation Using the Multilingual Thesaurus EUROVOC[A].In:CICLing 2002[C]:415-424.
  • 9Md.Maruf Hasan and Yuji Matsumoto.Multilingual Document Alignment-A Study with Chinese and Japanese[A].In:Proceedings of the Sixth Natural Language Processing Pacific Rim Symposium (NLPRS2001)[C],Tokyo,November 2001,617-623.
  • 10Md.Maruf Hasan.Cross-language Information Retrieval,Document Alignment and Visualization -A Study with Japanese and Chinese[D].PHD thesis(2001),Nara Institute of Science and Technology.

共引文献14

同被引文献17

引证文献2

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部