摘要
通过剖析短语结构层次句法分析的层次性和汉语结构特点,提出了一种把核心词作为词块的形式替换、层层进行词块组合的句法结构树.在句法边界分析的过程中,将词块识别和核心词提取分开进行.在词块识别模块,使用双向长短期记忆模型(Bi-directional Long Short-Term Memory,BiLSTM)和条件随机场(Conditional Random Field,CRF)结合的模型(BiLSTM+CRF)进行词块边界标记的识别,其中BiLSTM模型学习上下文特征,CRF模型学习输出标记序列结果的转移特征,达到预测标记序列联合解码的作用;在核心词提取模块,结合Word2vec词向量改进TextRank重要度排序算法,通过加入词语的相似度信息、位置信息、词性信息来提高识别准确度.实验对比了CRF、BiLSTM、BiLSTM+CRF词块识别分别与三种信息组合的TextRank核心词识别的句法边界分析效果,并对比了各句长下每种模型的识别情况.结果表明,使用BiLSTM+CRF联合改进的TextRank识别效果最好,相比于基线LR方法F1值提升了6.58个百分点,整句正确率提升了3.68个百分点,验证了模型的有效性和稳定性.
By analyzing the hierarchical and the characteristics of Chinese structure with syntactic analysis of phrase structure,a syntactic structure tree that replaces the core words as the form of chunk and combines word blocks layer by layer is proposed.During syntactic boundary analysis,chunk recognition and core word extraction are performed separately.In the chunk recognition module,Bi-directional Long Short-Term Memory model and Conditional Random Field are used to recognize the chunk boundary markers,Where BiLSTM model learns the context features,and CRF learns the transfer features of the output mark sequence results,which achieves the role of predicting the joint decoding of the mark sequence.In the core word extraction module,combined with Word2vec,the TextRank importance factor is improved.Recognition accuracy is improved by adding word similarity information,location information,and part-of-speech information.The experiment compares the syntactic boundary analysis effects of CRF,BiLSTM,BiLSTM+CRF chunk recognition and TextRank core word recognition combined with three kinds of information,Then,compares the recognition situation of each model under each sentence length.The results show that the model recognition effect by using BiLSTM+CRF and improved TextRank is the best.Compared with the baseline LR method,the F1 value has increased by 6.58%,and the overall accuracy has increased by 3.68%,which verifies the effectiveness and stability of the model.
作者
杨陈菊
邵玉斌
孙俊
龙华
皮乾东
YANG Chen-ju;SHAO Yu-bin;SUN Jun;LONG Hua;PI Qian-dong(College of Information Engineering and Automation,Kunming University of Science and Technology,Kunming 650500,China;National Key Laboratory of Computer Science of Yunnan Province,Kunming University of Science and Technology,Kunming 650500,China)
出处
《小型微型计算机系统》
CSCD
北大核心
2022年第7期1394-1400,共7页
Journal of Chinese Computer Systems
基金
国家自然科学基金项目(61761025)资助.
关键词
词块识别
核心词提取
条件随机场
双向长短期记忆模型
TextRank
chunk recognition
core word extraction
conditional random field
bi-directional long short-term memory model
TextRank