摘要
为了解决柬埔寨语词法标注语料稀缺、柬埔寨语命名实体缺乏明显标识特征的问题,提出一种引入英柬跨语言特征的柬埔寨语命名实体识别方法.首先,借助英语命名实体的成熟模型及英柬双语平行语料的词对齐关系,将源语言的实体类别映射到目标语言;然后根据柬埔寨语词向量构造最近邻图,采用标签传播算法,获得柬埔寨语单词的实体类别分布,完成跨语言知识转移;最后,将柬埔寨语单词的命名实体类别分布作为约束特征融入到条件随机场模型中.实验结果表明,融入跨语言特征的条件随机场模型能有效地提升柬埔寨语命名实体识别的效果.
In order to solve the scarcity of Khmer lexical labeled corpus and the lack of obvious discriminative features of Khmer named entity,a Khmer named entity recognition method using English-Khmer cross lingual features is proposed.First,the named entity category of the source language word is projected to the target language side by means of the mature English named entity recognition model and the word alignment pairs from bilingual parallel corpus.Then the nearest neighbor graph is built using the Khmer word embeddings,label propagation algorithm is used to obtain the named entity category distribution of Khmer words,and cross lingual knowledge transfer is carried out.Finally,the named entity category distribution of Khmer words is incorporated into the conditional random field model as a constraint feature.The experimental results show that the conditional random field model incorporating cross lingual features can improve the effect of named entity recognition of Khmer.
作者
徐广义
严馨
余正涛
周丽华
XU Guang-yi;YAN Xin;YU Zheng-tao;ZHOU Li-hua(Yunnan Nantian Electronics Information Co.,Ltd.,Kunming 650041,China;School of Information Engineering and Automation,Kunming University of Science and Teehnology,Kunming 650500,China;School of Information Science and Engineering,Yunnan University,Kunming 650500,China)
出处
《云南大学学报(自然科学版)》
CAS
CSCD
北大核心
2018年第5期865-871,共7页
Journal of Yunnan University(Natural Sciences Edition)
基金
国家自然科学基金(61462055
61562049
61363044)
云南省高新技术产业发展项目计划(201606)
关键词
英柬双语
柬语命名实体识别
跨语言映射
标签传播
词向量
English - Khmer
Khmer named entity recognition
cross lingual projection
label propagation
word embeddings