期刊文献+

基于LDA主题模型的维吾尔语无监督词义消歧 被引量:2

Unsupervised word sense disambiguation for Uyghur based on LDA topic model
下载PDF
导出
摘要 维吾尔语是典型的资源稀缺型语言,由于词义消歧标注语料资源和语义分析工具的不足,导致传统的有监督方法难以实现.针对该问题,将篇章文本的词义消歧问题类比为文本主题分类问题,在LDA(latent Dirichlet allocation)主题模型的基础上提出了一种维吾尔语无监督词义消歧模型.为强化主题模型对歧义词语义项的分类性能,加入了3个数据预处理过程:去除停用词,过滤有效词和强化同义词词频权重.实验结果表明,在随机抽取的63组测试样本集中,该模型的词义消歧准确率达到65.08%,在篇章文本采样词任务中词义消歧准确率达到61.2%. As a resource-scarce language,due to the shortage of corpus resources and semantic analysis tools,Uyghur faces the difficulty of being implemented with the traditional supervised method for its word sense disambiguation(WSD).In this paper,we compare the textual WSD problems as text subject classification problems,and propose an unsupervised Uyghur WSD model based on the latent Dirichlet allocation(LDA)topic model.In order to enhance the classification performance of the topic model on various meanings of ambiguous words,we add three data preprocessing processes:removing stop words,filtering effective words and strengthening synonyms frequency weight.Experimental results show that the accuracy of this WSD model increases to 65.08%in random test samples of 63 sets and 61.2%in the document-level sampling-word task.
作者 袁扬 李晓 杨雅婷 YUAN Yang;LI Xiao;YANG Yating(The Xinjiang Technical Institute of Physics&Chemistry,Chinese Academy of Sciences,Urumqi 830011,China;University of Chinese Academy of Sciences,Beijing 100049,China;Xinjiang Laboratory of Minority Speech and Language Information Processing,Urumqi 830011,China)
出处 《厦门大学学报(自然科学版)》 CAS CSCD 北大核心 2020年第2期198-205,共8页 Journal of Xiamen University:Natural Science
基金 国家自然科学基金(U1703133) 新疆维吾尔自治区“天山雪松计划”(2017XS05) 新疆维吾尔自治区重点实验室开放课题(2018D04018) 新疆维吾尔自治区高层次人才引进工程项目(Y839031201) 中国科学院青年创新促进会项目(2017472)。
关键词 维吾尔语 无监督词义消歧 主题模型 语义相似度 同义词 Uyghur unsupervised word sense disambiguation topic model semantic similarity synonyms
  • 相关文献

二级参考文献7

  • 1袁毓林.语义角色的精细等级及其在信息处理中的应用[J].中文信息学报,2007,21(4):10-20. 被引量:45
  • 2Hutchins, W.John. Machine translation over fifty years, 2001, http://www.hutehinsweb.me.uk/main.htm.
  • 3Daniel Jurafsky, James H.Martin. Speech and Language Processing (2nd Edition), Prentice Hall, 2008.
  • 4Xu Sun, Hou-Feng Wang, and Bo Wang. Predicting Chinese Abbreviations from Definitions: An Empirical Learning Approach Using Support Vector Regression. Journal of Computer Science and Technology, 2008,?23? (4), 602-611.
  • 5Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: Proc. 18th International Conf. on Machine Learning, Morgan Kaufmann, San Francisco, 2001, 282-289.
  • 6Sarawagi, Sunita; William W. Cohen. "Semi-Markov conditional random fields for information extraction", in Lawrence K. Saul, Yair Weiss, Leon Bottou (eds.). Advances in Neural Information Processing Systems 17. Cambridge, MA: MIT Press. 2005,1185-1192.
  • 7何径舟,王厚峰.基于特征选择和最大熵模型的汉语词义消歧[J].软件学报,2010,21(6):1287-1295. 被引量:37

共引文献2

同被引文献27

引证文献2

二级引证文献5

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部