摘要
A low-than character feature embedding called radical embedding is proposed,and applied on a long-short term memory(LSTM) model for sentence segmentation of pre-modern Chinese texts.The dataset includes over 150 classical Chinese books from 3 different dynasties and contains different literary styles.LSTM-conditional random fields(LSTM-CRF) model is a state-of-the-art method for the sequence labeling problem.This model adds a component of radical embedding,which leads to improved performances.Experimental results based on the aforementioned Chinese books demonstrate better accuracy than earlier methods on sentence segmentation,especial in Tang’s epitaph texts(achieving an F1-score of 81.34%).
A low-than character feature embedding called radical embedding is proposed, and applied on a long-short term memory(LSTM) model for sentence segmentation of pre-modern Chinese texts. The dataset includes over 150 classical Chinese books from 3 different dynasties and contains different literary styles. LSTM-conditional random fields(LSTM-CRF) model is a state-of-the-art method for the sequence labeling problem. This model adds a component of radical embedding, which leads to improved performances. Experimental results based on the aforementioned Chinese books demonstrate better accuracy than earlier methods on sentence segmentation, especial in Tang’s epitaph texts(achieving an F1-score of 81.34%).
基金
supported by the Fund of the key laboratory of rich-media knowledge organization and service of digital publishing content ( ZD2018-07 /05)