摘要
针对传统的新词发现中,数据的稀疏性使一些低频新词无法识别等问题,提出一种对分词结果计算信息量且将深度学习模型BiLstm-CRF用于新词发现的方法,计算出的信息量用以表示词语内部粘合度和分离度,并加入人工规则进行过滤。BiLstm-CRF模型精度高,对词向量的依赖小,考虑到上下文信息。信息量和模型两部分的结合解决了大量人工序列标注问题,提高了低频新词的识别率。实验结果表明,该方法能有效提高了新词识别的准确率。
As to traditional new word discovery,the sparseness of data makes some low frequency new words unidentified,a method was proposed to calculate the amount of information for the word segmentation and the deep learning model BiLstm-CRF was joined for new words discovery. The calculated amount of information was used to represent the internal adhesion and resolution of the words while artificial rules were added to filter. BiLstm-CRF model has high precision and low dependence on word vectors,and takes the contextual information into account. The combination of the two parts not only solves the problem of a large number of artificial sequence labels,but also increases the recognition rate of new low-frequency words. Experimental results show that proposed method effectively improves the accuracy of new word recognition.
作者
黄文明
杨柳青青
任冲
HUANG Wen-ming;YANG Liu-qing-qing;REN Chong(School of Computer Science and Information Security,Guilin University of Electronic Technology,Guilin 541004,China)
出处
《计算机工程与设计》
北大核心
2019年第7期1903-1907,1914,共6页
Computer Engineering and Design
基金
广西高校云计算与复杂系统重点实验室基金项目(yf17106)
桂林市科学研究与技术开发计划基金项目(2016010406-1)
广西科技攻关计划基金项目(桂科攻1598019-6)
桂林电子科技大学研究生创新基金项目(2016YJCX64)