摘要
中文命名实体识别常使用字符嵌入作为神经网络模型的输入,但是中文没有明确的词语边界,字符嵌入的方法会导致部分语义信息的丢失。针对此问题,该文提出了一种基于多颗粒度文本表征的中文命名实体识别模型。首先,在模型输入端结合了字词表征,然后借助N-gram编码器挖掘N-gram中潜在的成词信息,有效地联合了三种不同颗粒度的文本表征,丰富了序列的上下文表示。该文在Weibo、Resume和OntoNotes4数据集上进行了实验,实验结果的F_(1)值分别达到了72.41%、96.52%、82.83%。与基准模型相比,该文提出的模型具有更好的性能。
Chinese named entity recognition utilizes character embedding as the input of neural network models,which may give rise to the loss of certain semantic information since there is no clear word boundary in Chinese.To figure out the aforementioned issue,this paper proposes an entity recognition method based on multi-granular text representations.Firstly,the char and word representation are combined as the model input.Then the N-gram encoder is exploited to explore the potential word information in the N-gram which enriches the contextual representation of the sequence.The experimental results on the Weibo,Resume and OntoNotes4 dataset outperform the baseline and reach 72.41%,96.52%and 82.83%respectively.
作者
田雨
张桂平
蔡东风
陈华威
宋彦
TIAN Yu;ZHANG Guiping;CAI Dongfeng;CHEN Huawei;SONG Yan(Human-Computer Intelligence Research Center,Shenyang Aerospace University,Shenyang,Liaoning 110136,China;School of Data Science,The Chinese University of HongKong(Shenzhen),Shenzhen,Guangdong 518172,China)
出处
《中文信息学报》
CSCD
北大核心
2022年第4期90-99,共10页
Journal of Chinese Information Processing
基金
国家自然科学基金(U1908216)
辽宁省重点研发计划(2019JH2/10100020)。