摘要
随着我国与柬埔寨的交流合作日益频繁,柬埔寨语的自然语言处理工作变得更为重要,针对柬埔寨语语料库资源有限、柬埔寨语组织机构名标注语料稀缺的问题,提出了一种基于半监督Tri-training的柬埔寨语组织机构名识别方法。该方法利用改进的Tri-training算法,结合柬埔寨语的语言特点进行实验。实验结果显示,准确率和召回率分别达到了65.68%、67.83%,表明该方法能有效利用大量未标注语料得到准确率较高的标注语料。
With the increasingly frequent exchanges and cooperation between China and Cambodia,natural language processing of Cambodian becomes more and more important.Due to the scarcity of corpus resources of Cambodian,the tagging corpus of the names of Cambodian organizations are also rare.A new method based on semi supervised Tri-training and combined with the Cambodian characteristics was suggested;it was confirmed that the accuracy and recall rate reached 65.68% and 67.83% respectively,which indicated that the method could effectively use a large number of untagged data to get a higher accuracy.
作者
谢俊
严馨
王若兰
周枫
李思远
XIE Jun;YAN Xin;WANG Ruo-lan;ZHOU Feng;GUO Jian-yi;LI Si-yuan(Key Laboratory of Intelligent Information Processing,Kunming University of Science and Technology,Kunming 650500,China)
出处
《软件导刊》
2018年第5期127-131,共5页
Software Guide
关键词
半监督学习
三体训练法
标注语料
特征选择
semi-supervised learning
Tri-training
tagged corpus
feature selection