期刊文献+

面向音素序列的黏着语词干提取研究

Phoneme Sequence Based Stemming of Agglutinative Language
下载PDF
导出
摘要 针对当前的黏着语词干提取任务难以处理具有上下文信息的句子级语料的问题,本文将维吾尔语作为研究对象,提出了一种句子上下文和字符特征相融合的,由BiLSTM、注意力机制(Attention)和CRF构成的词干提取模型.首先以句子级别的字符特征向量为输入,使用BiLSTM模型获取正向和反向的上下文序列特征,并在此模型上加入注意力机制进行权重学习,通过提取全局特征信息来捕获词干和词缀边界;最后添加CRF使其从序列特征中学习更多信息,从而更有效地描述上下文信息.为验证上述模型的有效性,将本文模型在两种不同的数据集上进行了实验,并且将本文模型跟传统模型进行了对比.实验结果表明,本文模型对于句子级语料的效果更好,可以更有效地提取词干.此外,本文提出的模型优于其他传统模型,能全面考虑数据特征,具有一定的优越性. For the problem that the current agglutinative language stemming task is difficult to deal with sentence-level corpus with context information,this paper takes Uyghur language as the research object,and proposes a stemming model composed of BiLSTM,Attention and CRF,which integrates sentence context and character features.First,the sentence-level character feature embedding is used as input,and the BiLSTM model is used to obtain the forward and backward context sequence features,and the Attention Mechanism is added to this model for weight learning,and capture stem and affix boundaries by extracting global feature information;Finally,the CRF is added to make it learn more information from the sequence features,so that the context information can be described more effectively.In order to verify the effectiveness of the above model,the model in this paper is tested on two different datasets,and the model in this paper is compared with the traditional model.The experimental results show that the model in this paper is more effective for sentence-level corpus and can extract stems more effectively.In addition,the model proposed in this paper outperforms other traditional models,can fully consider the data characteristics,and has certain advantages.
作者 古再力努尔·依明 米吉提·阿不里米提 哈妮克孜·伊拉洪 艾斯卡尔·艾木都拉 Gvzelnur Imin;Mijit Ablimit;Hankiz Yilahun;Askar Hamdulla(College of Information Science and Engineering,Xinjiang University,Urumqi 830046,China)
出处 《小型微型计算机系统》 CSCD 北大核心 2023年第10期2362-2368,共7页 Journal of Chinese Computer Systems
基金 国家重点研发计划项目(2017YFC0820603)资助.
关键词 黏着语 维吾尔语 词干提取 上下文 注意力机制 BiLSTM-Attention-CRF agglutinative language Uyghur language stemming context attention mechanism BiLSTM-attention-CRF
  • 相关文献

参考文献10

二级参考文献71

  • 1霍盛.试论维吾尔语形态变化的功能及其特点[J].新疆大学学报(哲学社会科学版),1991,23(3):104-111. 被引量:3
  • 2古丽拉.阿东别克,米吉提.阿布力米提.维吾尔语词切分方法初探[J].中文信息学报,2004,18(6):61-65. 被引量:39
  • 3俞鸿魁,张华平,刘群,吕学强,施水才.基于层叠隐马尔可夫模型的中文命名实体识别[J].通信学报,2006,27(2):87-94. 被引量:152
  • 4哈米提·铁木尔.现代维吾尔语语法[M].北京:民族出版社,1987:246-248.
  • 5哈力克·尼亚孜.基础维吾尔语[M].乌鲁木齐:新疆大学出版社,1997.86-88.
  • 6JurafskyD,MartinJH_自然语言处理综论[M].冯志伟,孙乐,译.北京:电子工业出版社,2005:38.
  • 7The Porter stemming algorithm [ OL ]. [ 2011 - 10 - 25 ]. http :// tartarus, org/martin/PorterStemmer/.
  • 8Lancaster[ OL]. [2011 - 10 -21 ]. http://www, comp. lancs, ac. uk/computing/research/stemming/.
  • 9Lovin[ OL]. [2011 - 10 -21 ]. http://www, cs. waikato, ac. nz/ - eibe/stemmers/.
  • 10Dawson J L. Suffix removal for word conflation [ J ]. Bulletin of the Association for Literary & Linguistic Computing, 1974,2 (3) :33 - 46.

共引文献34

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部