摘要
在语种识别中,传统的N-Gram方法对文本长度依赖度高,因而无法有效地对短文本进行语种识别。现有的基于神经网络的模型无法同时考虑词本身信息和词间组合信息,从而降低了短文本语种识别的质量。针对以上问题,提出一种基于深度学习的字符级短文本语种识别方法。采用卷积神经网络从字符向量中获取词中字符组合信息;通过长短期记忆网络获取词与词之间的特征信息;使用全连接网络实现相似语言的语种识别。在维吾尔语、哈萨克语以及DSL2017数据集上的实验结果表明,该方法可以有效地提高相似语言短文本的识别精度。
In the language identification,the traditional N-Gram method has a high degree of dependence on the length of the text,so it cannot effectively identify the short text.Moreover,the existing models based on neural network cannot consider the information of the word itself and the combination of words at the same time,which reduces the quality of short text recognition.Aiming at the above problems,this paper proposes a character level short text language identification method based on deep learning.CNN was used to obtain the character combination information from the character vector.Then,LSTM was used to obtain the features between words.Finally,we used the full connection network to realize the language identification of similar languages.The experimental results on the corpus of Uyghur and Kazakh as well as DSL2017 show that this method can effectively improve the identification accuracy of short texts in similar languages.
作者
张琳琳
杨雅婷
陈沾衡
潘一荣
李毓
Zhang Linlin;Yang Yating;Chen Zhanheng;Pan Yirong;Li Yu(Xinjiang Technical Institute of Physics and Chemistry,Chinese Academy of Sciences,Urumqi 830011,Xinjiang,China;University of the Chinese Academy of Sciences,Beiing 100049,China;Xinjiang Laboratory of Minority Speech and Language Information Processing,Xinjiang Technical Institute of Physics and Chemistry,Urumgi 830011,Xinjiang,China)
出处
《计算机应用与软件》
北大核心
2020年第2期124-129,176,共7页
Computer Applications and Software
基金
国家自然科学基金项目(U1703133)
中科院西部之光项目(2017-XBQNXZ-A-005)
中国科学院青年创新促进会项目(2017472)
新疆维吾尔自治区重大科技专项(2016A03007-3)
新疆维吾尔自治区高层次人才引进工程项目(Y839031201)。
关键词
语种识别
相似语言
短文本
神经网络
文本分类
Language identification
Similar language
Short text
Neural network
Text categorization