摘要
新词发现,作为自然语言处理的基本任务,是用计算方法研究中国古代文学必不可少的一步。该文提出一种基于古汉语料的新词识别方法,称为AP-LSTM-CRF算法。该算法分为三个步骤。第一步,基于Apache Spark分布式并行计算框架实现的并行化的Apriori改进算法,能够高效地从大规模原始语料中产生候选词集。第二步,用结合循环神经网络和条件随机场的切分概率模型对测试集文档的句子进行切分,产生切分概率的序列。第三步,用结合切分概率的过滤规则从候选词集里过滤掉噪声词,从而筛选出真正的新词。实验结果表明,该新词发现方法能够有效地从大规模古汉语语料中发现新词,在宋词和宋史数据集上分别进行实验,F1值分别达到了89.68%和81.13%,与现有方法相比,F1值分别提高了8.66%和2.21%。
New word detection,as a fundamental task in natural language processing,is an indispensable step in the computational study of ancient Chinese literature.In this work,we present an AP-LSTM-CRF model to discover new words in ancient Chinese literature.This model consists of three steps.First,the parallelized improved-Apriori algorithm,implemented on Apache Spark(a distributed parallel computing framework),is used to generate candidate character sequences from large-scale raw corpus.Second,a segmentation model which combines recurrent neural network and conditional random field is used to generate segmentation sequences with probabilities.Third,we design a rule based filter to remove noise words in the candidate character sequences.Experimental results demonstrate that the method is capable of detecting new words in large-scale ancient Chinese corpus effectively.The F1 is up to89.68% and 81.13%in Song Poetry dataset and History of the Song Dynasty dataset,respectively.
作者
刘昱彤
吴斌
谢韬
王柏
LIU Yutong;WU Bin;XIE Tao;WANG Bai(Beijing Key Laboratory of Intelligent Telecommunications Software and Multimedia,Beijing University of Posts and Telecommunications,Beijing 100876,China)
出处
《中文信息学报》
CSCD
北大核心
2019年第1期46-55,共10页
Journal of Chinese Information Processing
基金
国家"973"重点基础研究发展计划(2013CB329606)
国家自然科学基金(61772082)
国家社会科学基金(16ZDA055)