期刊文献+

一种基于层叠CRF的古文断句与句读标记方法 被引量:9

Method of sentence segmentation and punctuating for ancient Chinese literatures based on cascaded CRF
下载PDF
导出
摘要 针对利用自然语言理解技术进行古汉语断句及句读标注的主要挑战是数据稀疏问题,设计了一种六字位标记集,提出了一种基于层叠式CRF模型的古文断句与句读标记方法。基于六字位标集,低层模型用观察序列确定句子边界,高层模型同时使用观察序列和低层的句子边界信息进行句读标记。实验在5M混合古文语料上分别进行了封闭测试和开放测试,封闭测试断句与句读标注的F值分别达到96.48%和91.35%,开放测试断句与句读标注的F值分别达到71.42%和67.67%。 Data sparseness is a primary challenge in sentence segmentation and punctuating for ancient Chinese literatures using natural language processing technology. In order to overcome this difficulty, designed a 6-tag set and proposed a method based on cascaded conditional random fields. The main idea was as follows : basing on the 6-tag set, a low level model deter- mined the boundaries of sentences according to observation sequence and a high level model punctuated sentences taking con- sideration of both observation sequence and low level' s results. Done close test and open test based on approximate 5M mixed corpus respectively. The F measure of sentence segmentation and punctuation were 96.48%. and 91.35% respectively in close test, and those were 71.42% and 67.67% respectively in open test.
出处 《计算机应用研究》 CSCD 北大核心 2009年第9期3326-3329,共4页 Application Research of Computers
基金 河南省科技厅攻关资助项目(0624480021)
关键词 古汉语 层叠条件随机场 数据稀疏 断句 句读标注 ancient Chinese literatures cascaded CRF data sparseness sentence segmentation punctuating
  • 相关文献

参考文献8

  • 1CHAROENPORNSAWAT P, SORNLERTLAMVANICH V. Automatic sentence break disambiguation for Thai [ C ]//Proc of ICCPOL ' 01. 2001:231-235.
  • 2胡俊峰,俞士汶.唐宋诗之计算机辅助深层研究[J].北京大学学报(自然科学版),2001,37(5):727-733. 被引量:24
  • 3陈天莹,陈蓉,潘璐璐,李红军,于中华.基于前后文n-gram模型的古汉语句子切分[J].计算机工程,2007,33(3):192-193. 被引量:25
  • 4LAFFERTY J, McCALLUM A, PEREIRA F. Conditional random field: probabilistic models for segmenting and labeling sequence data [C]//Proc of the 18th International Conference on Machine Learning. San Francisco: Morgan Kaufmann Publishers, 2001: 282-289.
  • 5刘群,张华平,俞鸿魁,程学旗.基于层叠隐马模型的汉语词法分析[J].计算机研究与发展,2004,41(8):1421-1429. 被引量:197
  • 6赵海,揭春雨.基于有效子串标注的中文分词[J].中文信息学报,2007,21(5):8-13. 被引量:26
  • 7ZHAO Hai, HUANG Chang-ning, LI Mu. An improved Chinese word segmentation system with conditional random field [ C ]//Prox of the 15th SIGHAN Workshop on Chinese Language Processing. Sydney: [s. n. ], 2006:162-165.
  • 8NOCEDAL J, WRIGHT S J. Numerical optimization [ M ]. New York : Springer, 1999 : 194-200.

二级参考文献41

  • 1黄昌宁.中文信息处理中的分词问题[J].语言文字应用,1997(1):74-80. 被引量:83
  • 2刘开瑛.现代汉语自动分词评测技术研究[J].语言文字应用,1997(1):103-108. 被引量:15
  • 3孙茂松,邹嘉彦.汉语自动分词研究评述[J].当代语言学,2001,3(1):22-32. 被引量:101
  • 4杨尔弘,方莹,刘冬明,乔羽.汉语自动分词和词性标注评测[J].中文信息学报,2006,20(1):44-49. 被引量:16
  • 5黄昌宁,赵海.中文分词十年回顾[J].中文信息学报,2007,21(3):8-19. 被引量:249
  • 6H Y Tan. Chinese place automatic recognition research. In: C N Huang, Z D Dong, eds. Proc of Computational Language.Beijing: Tsinghua University Press, 1999
  • 7Zhang Huaping, Liu Qun, Zhang Hao, et al. Automatic recognition of Chinese unknown words recognition. First SIGHAN Workshop Attached with the 19th COLING, Taipei, 2002
  • 8S R Ye, T S Chua, J M Liu. An agent-based approach to Chinese named entity recognition. The 19th Int'l Conf on Computational Linguistics, Taipei, 2002
  • 9J Sun, J F Gao, L Zhang, et al. Chinese named entity identification using class-based language model. The 19th Int'l Conf on Computational Linguistics, Taipei, 2002
  • 10Lawrence R Rabiner. A tutorial on hidden Markov models and selected applications in speech recognition. Proc of IEEE, 1989,77(2): 257~286

共引文献256

同被引文献91

引证文献9

二级引证文献47

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部