摘要
维吾尔语是典型的资源稀缺型语言,由于词义消歧标注语料资源和语义分析工具的不足,导致传统的有监督方法难以实现.针对该问题,将篇章文本的词义消歧问题类比为文本主题分类问题,在LDA(latent Dirichlet allocation)主题模型的基础上提出了一种维吾尔语无监督词义消歧模型.为强化主题模型对歧义词语义项的分类性能,加入了3个数据预处理过程:去除停用词,过滤有效词和强化同义词词频权重.实验结果表明,在随机抽取的63组测试样本集中,该模型的词义消歧准确率达到65.08%,在篇章文本采样词任务中词义消歧准确率达到61.2%.
As a resource-scarce language,due to the shortage of corpus resources and semantic analysis tools,Uyghur faces the difficulty of being implemented with the traditional supervised method for its word sense disambiguation(WSD).In this paper,we compare the textual WSD problems as text subject classification problems,and propose an unsupervised Uyghur WSD model based on the latent Dirichlet allocation(LDA)topic model.In order to enhance the classification performance of the topic model on various meanings of ambiguous words,we add three data preprocessing processes:removing stop words,filtering effective words and strengthening synonyms frequency weight.Experimental results show that the accuracy of this WSD model increases to 65.08%in random test samples of 63 sets and 61.2%in the document-level sampling-word task.
作者
袁扬
李晓
杨雅婷
YUAN Yang;LI Xiao;YANG Yating(The Xinjiang Technical Institute of Physics&Chemistry,Chinese Academy of Sciences,Urumqi 830011,China;University of Chinese Academy of Sciences,Beijing 100049,China;Xinjiang Laboratory of Minority Speech and Language Information Processing,Urumqi 830011,China)
出处
《厦门大学学报(自然科学版)》
CAS
CSCD
北大核心
2020年第2期198-205,共8页
Journal of Xiamen University:Natural Science
基金
国家自然科学基金(U1703133)
新疆维吾尔自治区“天山雪松计划”(2017XS05)
新疆维吾尔自治区重点实验室开放课题(2018D04018)
新疆维吾尔自治区高层次人才引进工程项目(Y839031201)
中国科学院青年创新促进会项目(2017472)。
关键词
维吾尔语
无监督词义消歧
主题模型
语义相似度
同义词
Uyghur
unsupervised word sense disambiguation
topic model
semantic similarity
synonyms