期刊文献+

Context Information and Fragments Based Cross-Domain Word Segmentation 被引量:8

Context Information and Fragments Based Cross-Domain Word Segmentation
下载PDF
导出
摘要 A new joint decoding strategy that combines the character-based and word-based conditional random field model is proposed.In this segmentation framework,fragments are used to generate candidate Out-of-Vocabularies(OOVs).After the initial segmentation,the segmentation fragments are divided into two classes as "combination"(combining several fragments as an unknown word) and "segregation"(segregating to some words).So,more OOVs can be recalled.Moreover,for the characteristics of the cross-domain segmentation,context information is reasonably used to guide Chinese Word Segmentation(CWS).This method is proved to be effective through several experiments on the test data from Sighan Bakeoffs 2007 and Bakeoffs 2010.The rates of OOV recall obtain better performance and the overall segmentation performances achieve a good effect. A new joint decoding strategy that com- bines the character-based and word-based condi- tional random field model is proposed. In this seg- mentation framework, fragments are used to gener- ate candidate Out-of-Vocabularies (OOVs). After the initial segmentation, the segmentation fragments are divided into two classes as " combination" (combining several fragments as an unknown word) and " segregation" (segregating to some words). So, more OOVs can be recalled. Moreover, for the characteristics of the cross-domain segmentation, context information is reasonably used to guide Chi- nese Word Segmentation (CWS). This method is proved to be effective through several experiments on the test data from Sighan Bakeoffs 2007 and Bakeoffs 2010. The rates of OOV recall obtain bet- ter performance and the overall segmentation per- formances achieve a good effect.
机构地区 Dalian Univ Technol
出处 《China Communications》 SCIE CSCD 2012年第3期49-57,共9页 中国通信(英文版)
基金 supported by the National Natural Science Foundation of China under Grants No.61173100,No.61173101 the Fundamental Research Funds for the Central Universities under Grant No.DUT10RW202
关键词 上下文信息 分词 解码策略 合理使用 测试数据 未登录词 召回率 分割 cross-domain CWS Conditional Ran-dem Fields(CRFs) joint decoding context variables segmentation fragments
  • 相关文献

参考文献5

二级参考文献39

  • 1陈小荷.自动分词中未登录词问题的一揽子解决方案[J].语言文字应用,1999(3):103-109. 被引量:26
  • 2刘群,张华平,俞鸿魁,程学旗.基于层叠隐马模型的汉语词法分析[J].计算机研究与发展,2004,41(8):1421-1429. 被引量:197
  • 3周俊生,戴新宇,尹存燕,陈家骏.基于层叠条件随机场模型的中文机构名自动识别[J].电子学报,2006,34(5):804-809. 被引量:111
  • 4黄昌宁,赵海.中文分词十年回顾[J].中文信息学报,2007,21(3):8-19. 被引量:246
  • 5苑春法 黄昌宁 等.汉语语素数据库的建造与应用[J].Communication of COLIPS,7(1):1-4.
  • 6苑春法 黄昌宁 等.现代汉语语素应用研究[J].Communication of COLIPS,6(2):55-59.
  • 7Peng Fuchun,Feng Fangfang,McCallum A.Chinese segmentation and new word detection using conditional random fields[C] // Proc of COLING 2004.San Francisco:Morgan Kaufmann,2004:562-568.
  • 8Zhang Ruiqiang,Kikui Genichiro,Sumita Eiichiro.Subword-based tagging by conditional random fields for Chinese word segmentation[C] //Proc of HLT-NAACL-2006.Morristown,NJ:ACL,2006:193-196.
  • 9Zhang Ruiqiang,Kikui Genichiro,Sumita Eiichiro.Subword-based tagging for confidence-dependent Chinese word segmentation[C] //Proc of the COLING/ACL on Main Conf Poster Sessions.Morristown,NJ:ACL,2006:961-968.
  • 10Shi Yanxin,Wang Mengqiu.A dual-layer CRFs based joint decoding method for cascaded segmentation and labeling tasks[C] //Proc of IJCAI 2007.Berlin:Springer,2007:1707-1712.

共引文献106

同被引文献38

引证文献8

二级引证文献62

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部