摘要
针对日渐丰富的少数民族语言资源进行管理、研究和使用有着重要的应用价值。为了解决语言差异引起的语言鸿沟,针对中朝两种语言环境下的跨语言文本分类任务,提出了双语主题词嵌入模型。该文将词嵌入模型与主题模型扩展到双语环境,并将两种模型相结合,解决了歧义性对跨语言文本分类精度带来的影响。首先,在大规模单词级别对齐平行句对中训练中朝单词的词嵌入向量;其次,利用主题模型对中朝分类语料进行表示,并获得中朝单词的含有主题信息的词嵌入向量;最后,将中朝单词的主题词嵌入向量输入至文本分类器,进行模型的训练与分类预测。实验结果表明,中朝跨语言文本分类任务的准确率达到了91.76%,已达到实际应用的水平,同时该文提出的模型可以对一词多义单词的多个词义有很好的表示。
A bilingual topical word embedding model is proposed for the Chinese-Korean cross-lingual text classification task.The model combines the topic model with the bilingual word embedding to solve the influence of the ambiguity caused by polysemy on the accuracy to cross-lingual text classification.Firstly,the word embedding representation of bilingual words is trained in a large scale parallel sentence pairs with word-alignment.Secondly,the dataset of classification task is processed and represented by topic model,and the topic words in both languages are obtained.Finally,the word embeddings of these topic words are input into the traditional text classifier and the deep learning text classifier.The experimental results show that the accuracy reach 91.76%in the Chinese-Korean cross-lingual text classification task.
作者
王琪
田明杰
崔荣一
赵亚慧
WANG Qi;TIAN Mingjie;CUI Rongyi;ZHAO Yahui(Intelligent Information Processing Lab.,Department of Computer Science and Technology,Yanbian University,Yanji,Jilin 133002,China)
出处
《中文信息学报》
CSCD
北大核心
2020年第12期39-47,共9页
Journal of Chinese Information Processing
基金
国家语委“十三五”科研规划项目(YB135-76)
延边大学外国语言文学世界一流学科建设科研项目(18YLPY13,18YLPY14)。