摘要
基于《四库全书》数据集,研究古汉语的命名实体识别技术。提出了基于Lattice LSTM模型的古汉语命名实体识别算法,该方法将字符序列信息和词序列信息共同作为模型的输入。采用甲言(jiayan)分词工具,利用word2vec训练古文字、词向量并作为Lattice LSTM模型的输入,提升了古汉语命名实体识别的效果。基于Lattice LSTM模型和预训练的古文字、词向量,提高了古汉语的实体识别效果,相比传统的BiLSTM-CRF模型,其F1分数提升3.95%左右。
Investigated the named entity recognition problem of ancient Chinese literature based on the Complete Collection of Four Treasuries dataset.Proposed an algorithm for named entity recognition of ancient Chinese literature based on the Lattice LSTM model.This method combines both character sequence information and word sequence information as input to the model.Using jiayan word segmentation tool,word2vec is used to train character and word level embedding of ancient Chinese as input to the Lattice LSTM model,which improves the performance of named entity recognition based on ancient Chineseliterature.Based on the Lattice LSTM model and pre-trained character and word level embedding of ancient Chinese,the performance of named entity recognition based on ancient Chinese literature is improved.Compared with the traditional Bi-LSTM-CRF model,its F1 score is improved by about 3.95%.
作者
崔丹丹
刘秀磊
陈若愚
刘旭红
李臻
齐林
CUI Dan-dan;LIU Xiu-lei;CHEN Ruo-yu;LIU Xu-hong;LI Zhen;QI Lin(Computer School,Beijing Information Science and Technology University,Beijing 100192,China)
出处
《计算机科学》
CSCD
北大核心
2020年第S02期18-22,共5页
Computer Science
基金
国家重点研发计划课题(2017YFB1400402)。