摘要
词汇的表示问题是自然语言处理的基础研究内容。目前单语词汇分布表示已经在一些自然语言处理问题上取得很好的应用效果,然而在跨语言词汇的分布表示上国内外研究很少,针对这个问题,利用两种语言名词、动词分布的相似性,通过弱监督学习扩展等方式在中文语料中嵌入泰语的互译词、同类词、上义词等,学习出泰语词在汉泰跨语言环境下的分布。实验基于学习到的跨语言词汇分布表示应用于双语文本相似度计算和汉泰混合语料集文本分类,均取得较好效果。
Word representation is the basic research content of natural language processing. At present, distributed representation of monolingual words has shown satisfactory application effect in some Neural Probabilistic Language (NPL) research, while as for distributed representation of cross-lingual words, there is little research both at home and abroad. Aiming at this problem, given distribution simi larity of nouns and verbs in these two languages, we embed mutual translated words, synonyms, superordinates into Chinese corpus by the weakly supervised learning extension approach and other methods, thus Thai word distribution in cross-lingual environment of Chinese and Thai is learned. We applied the distributed representation of the cross-lingual words learned before to compute similarities of bilingual texts and classify the mixed text corpus of Chinese and Thai. Experimental results show that the proposal has a satisfactory effect on the two tasks.
出处
《计算机工程与科学》
CSCD
北大核心
2015年第12期2358-2365,共8页
Computer Engineering & Science
基金
国家自然科学基金资助项目(61363044)
关键词
弱监督学习扩展
跨语言语料
跨语言词汇分布表示
神经概率语言模型
weakly supervised learning extension
cross-lingual corpus
cross-lingual word distribution representations
neural probabilistic language model