期刊文献+

基于层叠隐马模型的汉语词法分析 被引量:198

Chinese Lexical Analysis Using Cascaded Hidden Markov Model
下载PDF
导出
摘要 提出了一种基于层叠隐马模型的汉语词法分析方法 ,旨在将汉语分词、词性标注、切分排歧和未登录词识别集成到一个完整的理论框架中 在分词方面 ,采取的是基于类的隐马模型 ,在这层隐马模型中 ,未登录词和词典中收录的普通词一样处理 未登录词识别引入了角色HMM :Viterbi算法标注出全局最优的角色序列 ,然后在角色序列的基础上 ,识别出未登录词 ,并计算出真实的可信度 在切分排歧方面 ,提出了一种基于N 最短路径的策略 ,即 :在早期阶段召回N个最佳结果作为候选集 ,目的是覆盖尽可能多的歧义字段 ,最终的结果会在未登录词识别和词性标注之后 ,从N个最有潜力的候选结果中选优得到 不同层面的实验表明 ,层叠隐马模型的各个层面对汉语词法分析都发挥了积极的作用 实现了基于层叠隐马模型的汉语词法分析系统ICTCLAS ,该系统在 2 0 0 2年的“九七三”专家组评测中获得第 1名 ,在 2 0 0 3年汉语特别兴趣研究组 (ACLSpecialInterestGrouponChineseLanguageProcessing ,SIGHAN)组织的第 1届国际汉语分词大赛中综合得分获得两项第 1名、一项第 2名 这表明 :ICTCLAS是目前最好的汉语词法分析系统之一 。 This paper presents an approach for Chinese lexical analysis using cascaded hidden Markov model (CHMM), which aims to incorporate Chinese word segmentation, part-of-speech tagging, disambiguation and unknown words recognition into an integrated theoretical frame. A class-based HMM is applied in word segmentation, and in this model, unknown words are treated in the same way as common words listed in the lexicon. Unknown words are recognized with reliability on roles sequence tagged using Viterbi algorithm in roles HMM. As for disambiguation, the authors bring forth an n-shortest-path strategy that, in the early stage, reserves the top N segmentation results as candidates and covers more ambiguity. Various experiments show that each level in the CHMM contributes to Chinese lexical analysis. A CHMM-based system ICTCLAS is accomplished. The system ranked top in the official open evaluation, which was held by the “973” project in 2002. And ICTCLAS achieved 2 first ranks and 1 second rank in the first international word segmentation bakeoff held by SIGHAN (the ACL Special Interest Group on Chinese Language Processing) in 2003. It indicates that ICTCLAS is one of the best Chinese lexical analyzers. In a word, CHMM is effective for Chinese lexical analysis.
出处 《计算机研究与发展》 EI CSCD 北大核心 2004年第8期1421-1429,共9页 Journal of Computer Research and Development
基金 国家"九七三"重点基础研究发展规划项目 (G19980 3 0 5 0 7 4 G19980 3 0 5 10 ) 中国科学院计算技术研究所领域前沿青年基金项目( 2 0 0 2 6180 2 3 )
关键词 汉语词法分析 分词 词性标注 未登录词识别 层叠隐马模型 ICTCLAS Chinese lexical analysis word segmentation POS tagging unknown words recognition cascaded hidden Markov model ICTCLAS
  • 相关文献

参考文献27

  • 1梁南元.书面汉语自动分词系统—CDWS[J].中文信息学报,1987,(2):44-52.
  • 2张华平,刘群.基于N-最短路径方法的中文词语粗分模型[J].中文信息学报,2002,16(5):1-7. 被引量:99
  • 3J Hockenmaier, C Brew. Error-driven learning of Chinese word segmentation. In: J Guo, K T Lua, J Xu, eds. The 12th Pacific Conf on Language and Information, Singapore, 1998
  • 4Andi Wu, Zixin Jiang. Word segmentation in sentence analysis.1998 Int'l Conf on Chinese Information Processing, Beijing, 1998
  • 5D Palmer. A trainable rule-based algorithm for word segmentation. The 35th Annual Meeting of the Association for Computational Linguistics (ACL'97), Madrid, 1997
  • 6Y Dai, C S G Khoo, T E Loh. A new statistical formula for Chinese text segmentation incorporating contextual information.ACM SIGIR99, Berkeley, 1999
  • 7高山,张艳,等.基于三元统计模型的汉语分词及标注一体化研究.见:自然语言理解与机器翻译.北京:清华大学出版社,2001.116-122(Gao Shan, Zhang Yan, et al. The research on integrated Chinese word segmentation and labeling based on trigram statistical model. In: Natural Language Understanding and Machine Translation(in Chinese). Beijing: Tsinghua University Press,2001. 116-122)
  • 8F Peng, D Schuurmans. A hierarchical EM approach to word segmentation. The 6th Natural Language Processing Pacific Rim Symposium (NLPRS-2001), Tokyo, 2001
  • 9WJ Teahan, Y Wen, R McNabI, et al. A Compression-based algorithm for Chinese word segmentation. Computational Linguistics, 2001, 26(3): 375~393
  • 10Nianwen Xue, Susan P Converse. Combining classifiers for Chinese word segmentation. First SIGHAN Workshop Attached with the 19th COLING, Taipei, 2002

二级参考文献12

  • 1孙茂松,黄昌宁,高海燕,方捷.中文姓名的自动辨识[J].中文信息学报,1995,9(2):16-27. 被引量:87
  • 2周强.规则和统计相结合的汉语词类标注方法[J].中文信息学报,1995,9(3):1-10. 被引量:43
  • 3罗智勇,宋柔.现代汉语自动分词中专名的一体化、快速识别方法[C]//Ji Dong-Hong.国际中文电脑学术会议,新加坡,2001:323-328.
  • 4Ji Heng, Luo Zhen-Shen. Inverse name frequency model and rules based on Chinese name identifying. In: Huang ChangNing, Zhang Pu ed.. Natural Language Understanding and Machine Translation. Beijing: Tsinghua University Press,2001, 123 - 128( in Chinese)(季姮,罗振声.基于反比概率模型和规则的中文姓名自动辨识系统.见:黄昌宁,张普编.自然语言理解与机器翻译.北京:清华大学出版社,2001,123-128)
  • 5Zhen Jia-Heng, Liu Kai-Ying. Discussion on strategy of surname and personal name processing in Chinese word segmentation. In: Chen Li-Wei ed.. Research and Application of Computational Linguistics. Beijing: Beijing Institute of Linguistics and Culture Press, 1993(in Chinese)(郑家恒刘开瑛.自动分词系统中姓氏人名的处理策略探讨.见:陈力为编.计算语言研究与应用.北京:北京语言学院出版社,1993)
  • 6Song Rou, Zhu Hong et al.. Approach of personal name recognition based on corpus and rules. In: Chen Li Wei ed.. Research and Application of Computational Linguistics. Beijing:Beijing Institute of Linguistics and Culture Press, 1993(in Chinese)(宋柔,朱宏等.基于语料库和规则库的人名识别法.见:陈力为编.计算语言研究与应用.北京:北京语言学院出版社,1993)
  • 7Wang Sheng, Huang De-Gen, Yang Yuan-Sheng. Chinese person name recognition based on mixture of statistics and rules.In: Huang Chang-Ning, Dong Zhen-Dong ed.. Corpora of Computational Linguistics. Beijing: Tsinghua University Press, 1999 (in Chinese)(王省,黄德根,杨元生.基于统计和规则相结合的中文姓名识别.见:黄昌宁,董振东编.计算语言学文集.北京:清华大学出版社,1999)
  • 8Chen Xiao-He. Automatic Analysis of Modern Chinese. Beijing: Beijing University Linguistics and Culture Press, 2000,104-114(in Chinese)(陈小荷.现代汉语自动分析.北京:北京语言文化大学出版社, 2000, 104-114 )
  • 9Rabiner L. R.. A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of IEEE, 1989, 77(2): 257~286
  • 10Rabiner L. R. , Juang B. H. An introduction to hidden Markov models. IEEE Acoustics, Speech & Signal Processing Magazine, 1986, 3:4~166

共引文献240

同被引文献1740

引证文献198

二级引证文献1386

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部