摘要
中文分词是自然语言处理中一项重要的基础任务。由于中文词汇存在多义词、同音字等特殊性,能够准确地完成分词任务是近年来中文分词研究面临的挑战之一。因此,本文提出了一种融合字符特征、拼音特征、五笔输入特征的共享BiLSTM-CRF模型,通过在训练过程中共享LSTM-网络来有效地融合语言特征。经大量数据集实验表明,特征融合能显著提高标记的准确性。在没有利用任何外部词汇资源的情况下,AS和CityU数据集中准确率可分别达到96.9%和97.3%。
Chinese Word Segmentation(CWS)is an important basic task in Natural Language Processing(NLP).Due to the particularity of polysemy and homonym in Chinese vocabulary,it is one of the challenges faced by Chinese word segmentation research in recent years to complete the task of word segmentation accurately.Therefore,this paper proposes a shared BiLSTM-CRF model which integrates character features,Pinyin features and Wubi input features,and effectively integrates language features by sharing LSTM network in the training process.Experiments on a large number of data sets show that feature fusion can significantly improve the accuracy of labeling.Without using any external vocabulary resources,the accuracy of AS and CityU data sets can reach 96.9%and 97.3%respectively.
作者
张倩
高建瓴
丁容
ZHANG Qian;GAO Jianling;DING Rong(College of Big Data and Information Engineering,Guizhou University,Guiyang 550025,China)
出处
《智能计算机与应用》
2022年第10期57-61,67,共6页
Intelligent Computer and Applications