摘要
该文从朝鲜语语法和构成特点出发,研究在音素、音节和词素三个不同粒度下朝鲜语实体的有效表征,提出一种基于多粒度融合的朝鲜语命名实体识别方法。该方法从不同粒度的联系和差异两方面进行多粒度特征的融合。首先,对朝鲜语的音素特征进行编码,并基于CNN架构构建将音素粒度与音节粒度融合的模型,获得音节向量。其次,使用fastText预训练模型对得到的音节向量进行编码,获取其顺序特征。同时,使用KLUE-BERT预训练模型对词素粒度特征进行建模,得到词素向量。最后,将之前得到的音节向量与词素向量进行融合,形成包含多粒度特征的文本表征,并利用基于Transformer的NER模型TENER完成朝鲜语命名实体识别。为了验证所提出方法的有效性,该文在Klpexpo2016和KLUE-NER语料库上进行了实验,结果表明所提出的不同粒度表征及融合方法能够很好地提取出朝鲜语的实体特征,取得了很好的效果,其中在Klpexpo2016语料库中的F_(1)值为89.45%,KLUE-NER语料库中的F_(1)值为88.82%。
This paper investigates an effective representation of Korean entities at three different granularities,i.e.jamo,syllable and morpheme,and proposes a multi-granularity fusion-based named entity recognition method for Korean.Firstly,it encodes the jamo-leval features of Korean and builds a CNN-based model to fuse jamo-level and syllable-level features to obtain syllable vectors.Secondly,the fastText pre-trained model is employed to encode the obtained syllable vectors to obtain their sequential features.And the KLUE-BERT is utilized to obtain morpheme vectors.Finally,the previously obtained syllable vectors and morpheme vectors are jointly applied to the task of named entity recognition for Korean via a Transformer-based NER model named TENER.Experiments on Klpexpo 2016 and KLUE-NER corpora show that the proposed method achieves 89.45%F_(1)score on Klpexpo 2016 corpus and 88.82%on KLUE-NER corpus.
作者
黄政豪
金光洙
高君龙
HUANG Zhenghao;JIN Guangzhu;GAO Junlong(College of Engineering,Yanbian University,Yanji,Jilin 133002,China;College of Korean and Han Language and Literature,Yanbian University,Yanji,Jilin 133002,China)
出处
《中文信息学报》
CSCD
北大核心
2023年第8期66-74,共9页
Journal of Chinese Information Processing
基金
国家哲学社会科学基金(18ZDA306)
延边大学外国语言文学世界一流学科建设攻关科研项目(18YLGG01)。