摘要
新词识别是食品安全信息处理中的一个难点,新词是造成分词错误的重要原因。利用互信息提取新词特征并采用BP神经网络过滤垃圾词串以识别新词,以提高食品安全文本分词准确率。首先在互信息新词识别基础上,得到候选新词的多个统计量特征。然后对候选字串是否成词进行人工标记。最后将统计量特征和人工标记的新词作为训练样本,建立BP神经网络新词识别模型。在食品安全信息文本数据上进行实验,该方法可以取得新词识别准确率(0.806)。结果表明基于互信息特征提取的BP神经网络新词识别模型可以很好地识别新词,降低词语误判,对于食品安全信息文本新词的识别和领域词典构建具有很好的应用价值。
Recognition of new words was a difficulty in food safety information processing,and new words were one of the important factors causing incorrect segmentation results.Use mutual information to extract the features of new word and use BP neural network to filter spam word strings to identify new words to improve the accuracy of food safety text segmentation.First,based on the mutual information recognition,several of statistical characteristics of candidate new words were obtained.Then manually mark whether the candidate string forms a word.Finally,using the statistical features and artificially labeled new words as training samples,a new word recognition model for BP neural network was established.Experiments were performed on food safety information text data,and the method could obtain new word extraction accuracy(0.806).The results showed that training language models using BP neural network modules based on mutual information feature extraction can recognize new words well,reduced misjudgment of words,and had practical value for new words automatic identification and domain dictionary construction in text data related to food safety information.
作者
马强
路阳
李菲
Ma Qiang;Lu Yang;Li Fei(College of Electrical and Information,Heilongjiang Bayi Agricultural University,Daqing163319)
出处
《黑龙江八一农垦大学学报》
2021年第2期73-79,共7页
journal of heilongjiang bayi agricultural university
基金
中国博士后科学基金面上项目(2016M591560)
黑龙江省政府博士后资助经费(LBH-Z15185)
黑龙江省博士后科研启动金资助项目(LBH-Q17134)
黑龙江省自然科学基金重点项目(ZD2019F001)
黑龙江省自然科学基金联合引导项目(LH2020F042)
黑龙江八一农垦大学校内培育重点课题(XA2016-05)。
关键词
互信息
食品安全信息
新词识别
BP神经网络
mutual information model
food safety information
new word discovery
BP neural network