基于子词信息的维吾尔语词项规范化

Normalization of Uyghur terms based on subword information

下载PDF

导出

摘要拉丁化的维吾尔语在使用过程中具有文本不规范的特点,这种不规范是造成歧义等现象的最主要原因,严重制约着与维吾尔语相关的自然语言处理应用.由此提出了一种无监督的基于子词信息的文本规范化方法,该方法在词向量构建过程中将词的内部信息考虑进去.这种方法可以对罕见词进行向量表示,也可以将词内部的形态信息融入词的表示,丰富词向量的表达,进而用于改进无监督学习中规范化词候选集生成质量的不足.实验表明,相比于传统词向量构建方法,该方法在文本规范化任务中可以提高规范化词的召回率. Latinized Uyghur language is characterized by nonstandard text in its use.This kind of non-standard type primarily causes the ambiguity,which seriously restricts the application of natural language processing related to Uyghur.This paper proposes a text normalization method based on subword information.The method takes the internal information of words into account in the process of constructing word vectors.In this way,rare words can be represented by the vector,and the morphological information inside the words can also be incorporated into the expression of the words to enrich the expression of the word vectors,which can be used to improve the quality of standardized word candidate set generation.Experimental results show that the proposed method can improve the recall rate of normalized words in text normalization tasks compared with traditional word vector construction methods.

作者张新路王磊杨雅婷米成刚 ZHANG Xinlu;WANG Lei;YANG Yating;MI Chenggang(Xinjiang Laboratory of Minority Speech and Language Information Processing,the XinjiangTechnical Institute of Physics & Chemistry,Chinese Academy of Science,Urumqi 830011,China;School of Computer Science and Technology,University of the Chinese Academy of Sciences,Beijing 100049,China)

机构地区中国科学院新疆理化技术研究所新疆民族语音语言信息处理实验室中国科学院大学计算机科学与技术学院

出处《厦门大学学报（自然科学版）》 CAS CSCD 北大核心 2019年第2期217-224,共8页 Journal of Xiamen University：Natural Science

基金国家自然科学基金(U1703133) 新疆自治区重大科技专项(2016A03007-3) 中国科学院"西部之光"人才培养引进计划(2017-XBQNXZ-A-005)

关键词维吾尔语自然语言处理文本规范化词嵌入 Uyghur natural language processing text normalization word embedding

分类号 TP391 [自动化与计算机技术—计算机应用技术]