Sentence segmentation for classical Chinese based on LSTM with radical embedding 被引量：7

Sentence segmentation for classical Chinese based on LSTM with radical embedding

导出

摘要 A low-than character feature embedding called radical embedding is proposed,and applied on a long-short term memory(LSTM) model for sentence segmentation of pre-modern Chinese texts.The dataset includes over 150 classical Chinese books from 3 different dynasties and contains different literary styles.LSTM-conditional random fields(LSTM-CRF) model is a state-of-the-art method for the sequence labeling problem.This model adds a component of radical embedding,which leads to improved performances.Experimental results based on the aforementioned Chinese books demonstrate better accuracy than earlier methods on sentence segmentation,especial in Tang’s epitaph texts(achieving an F1-score of 81.34%). A low-than character feature embedding called radical embedding is proposed, and applied on a long-short term memory（LSTM） model for sentence segmentation of pre-modern Chinese texts. The dataset includes over 150 classical Chinese books from 3 different dynasties and contains different literary styles. LSTM-conditional random fields（LSTM-CRF） model is a state-of-the-art method for the sequence labeling problem. This model adds a component of radical embedding, which leads to improved performances. Experimental results based on the aforementioned Chinese books demonstrate better accuracy than earlier methods on sentence segmentation, especial in Tang’s epitaph texts（achieving an F1-score of 81.34%）.

作者 Han Xu Wang Hongsu Zhang Sanqian Fu Qunchao Liu Jun

机构地区 School of software Engineering Key Laboratory of Trustworthy Distributed Computing and Service The Key Laboratory of Rich-Media Knowledge Organization and Service of Digital Publishing Content Insitute of Quantitative Social Science Department of statistics

出处《The Journal of China Universities of Posts and Telecommunications》 EI CSCD 2019年第2期1-8,共8页 中国邮电高校学报（英文版）

基金 supported by the Fund of the key laboratory of rich-media knowledge organization and service of digital publishing content ( ZD2018-07 /05)

关键词 LSTM RADICAL EMBEDDING SENTENCE SEGMENTATION LSTM radical embedding sentence segmentation

分类号 TN [电子电信]

引文网络
相关文献

参考文献2

1陈天莹,陈蓉,潘璐璐,李红军,于中华.基于前后文n-gram模型的古汉语句子切分[J].计算机工程,2007,33(3):192-193. 被引量：25
2黄建年,侯汉清.农业古籍断句标点模式研究[J].中文信息学报,2008,22(4):31-38. 被引量：28

二级参考文献19

1陈振宇,陈振宁.怎样计算现代汉语句子的时间信息[J].中文信息学报,2005,19(3):94-104. 被引量：6
2张文国.古汉语的“N+N”结构及其发展[J].长安大学学报（社会科学版）,2006,8(2):80-83. 被引量：1
3常娥,侯汉清,曹玲.古籍自动校勘的研究和实现[J].中文信息学报,2007,21(2):83-88. 被引量：16
4张亮,陈家骏.基于大规模语料库的句法模式匹配研究[J].中文信息学报,2007,21(5):31-35. 被引量：8
5董恺忱范楚玉编.中国科学技术史·农学卷[M].北京:科学出版社,2000..
6衡中青,刘竟,侯汉清.《方志物产》引书挖掘及分析研究——以《岭南丛述》(物产)为例[J].中国农史,2007,26(3):132-139. 被引量：10
7MANNING C D，SCHOTZE H．统计自然语言处理基础[M]．苑春法，等译．北京：电子工业出版社，2005．
8Palmer,David D,Hearst,at al.Adaptive Multilingual Sentence Boundary Disambiguation[J].Computational Linguistics,1997,23(2).
9Charoenpornsawat P,Sornlertlamvanich V.Automatic Sentence Break Disambiguation for Thai[C]//Proceedings of ICCPOL'01,Korea,2001:231-235.
10Chen,Stanley F,Goodman J.An Empirical Study of Smoothing Techniques for Language Modeling[R].Center for Research in Computing Technology,Harvard University,Technical Report:TR-10-98,1998.