摘要
跨境民族文化领域实体通常由描述民族文化特征的领域词汇组合构成,使用当前主流的基于字符表征的实体识别方法会面临领域实体边界模糊问题,造成实体识别错误。为此,该文提出一种融入词集合信息的跨境民族文化实体识别方法,利用领域词典获取的词集合增强领域实体的词边界和词语义信息。首先,构建跨境民族文化领域词典,用于获取词集合信息;其次,通过词集合注意力机制获取词集合向量之间的权重,并融入位置编码增强词集合位置信息;最后,在特征提取层融入词集合信息,增强领域实体边界信息并缓解仅使用字符特征表示所带来的词语义缺失问题。实验结果表明,在跨境民族文化文本数据集上所提出方法相比于基线方法的F_(1)值提升了2.71%。
Cross-border national cultural entities are usually composed of domain words that describe national cultural characteristics.This paper proposes a cross-border national cultural entity recognition method with word set information obtained from domain lexicon.Firstly,a cross-border national cultural domain lexicon is constructed to obtain the word set information.Secondly,the weight between the word set vectors is obtained through attention mechanism,and the positional encoding is adopted.Finally,the word set information is incorporated into the feature extraction layer to enhance the domain entity boundary information and alleviate the problem of word information loss caused by using only character features.Experimental results show that,the F_(1) value of the proposed method is improved by 2.71%compared with the baseline method.
作者
杨振平
毛存礼
雷雄丽
高盛祥
陆杉
张勇丙
YANG Zhenping;MAO Cunli;LEI Xiongli;GAO Shengxiang;LU Shan;ZHANG Yongbing(Faculty of Information Engineering and Automation,Kunming University of Science and Technology,Kunming,Yunnan 650500,China;Yunnan Key Laboratory of Artificial Intelligence,Kunming University of Science and Technology,Kunming,Yunnan 650500,China;Kunming Metallurgical College,Kunming,Yunnan 650500,China)
出处
《中文信息学报》
CSCD
北大核心
2022年第10期88-96,共9页
Journal of Chinese Information Processing
基金
国家自然科学基金(61732005,61866019,61761026,61972186)
云南省应用基础研究计划重点项目(2019FA023)
云南特色产业数字化研究与应用示范(202002AD080001)
云南省中青年学术和技术带头人后备人才项目(2019HB006)。
关键词
跨境民族文化
实体识别
词集合信息
领域词典
注意力机制
cross-border national culture
entity recognition
word set information
domain lexicon
attention mechanism