摘要
中文分词技术目前存在的一个问题是针对特定领域未登录词识别效率较低的问题。建筑类文本分词由于受到专业本身词语的特点等限制,分词时对未登录词的识别效果不太好。提出一种非监督的基于改进算法与邻接熵结合的方法来进行未登录词的识别。首先通过算法对文本间相互依赖值比较大的字串进行识别,然后通过停用词表和语料库进行筛选过滤得到候选词典,计算候选词典之间的邻接熵,设定阈值确定最后的未登录词,最后将识别的未登录词作为加入到专业词典进行分词。通过实验证明建筑领域文本在使用提出的算法时对于未登录词有较好的识别效果,准确率较算法提高了15.92%,召回率提高了7.61%,因此最终的分词效果在准确率和召回率分别可达到82.15%、80.45%。
One of the current problems of Chinese word segmentation technology is the low efficiency of Out-Of-Vocabulary(OOV)detection in specific fields.Due to restrictions on the characteristics of the words of the profession itself,the word segmentation of architectural texts is not very effective in identifying OOV.This paper proposes an unsupervised method based on the combination of improved algorithm and entropy to identify OOV.This paper uses algorithms to identify strings with relatively large interdependencies between texts,filters through the stop-words vocabulary and corpus to obtain candidate dictionaries,calculates the entropy between candidate dictionaries,and sets a threshold to determine the final OOV,Add the recognized OOV as a professional dictionary for word segmentation.Experiments have proved that the construction field text has a better recognition effect for OOV when using the algorithm proposed in this paper.Compared with the algorithm,the P(precision)is increased 15.92%,and the R(recall)is increased 7.61%.Therefore,the final word segmentation precision can reach 82.15%and recall can reach 80.45%.
作者
李鹏
光永星
乔天玲
操峻岩
LI Peng;GUANG Yong-xing;QIAO TIAN-ling;CAO Jun-yan(School of Science.,Shenyang Jianzhu University,Shenyang 11000,China)
出处
《电脑与信息技术》
2021年第5期67-72,共6页
Computer and Information Technology
关键词
新词识别
互信息
中文分词
new word recognition
pointwise mutual information
Chinese word segmentation