期刊文献+

基于条件随机场模型的汉语功能块自动标注 被引量:7

Automatic Labeling of Chinese Functional Chunks Based on Conditional Random Fields Model
下载PDF
导出
摘要 汉语组块分析是将汉语句子中的词首先组合成基本块,进一步组合形成句子的功能块,最终形成一个具有层次组合结构的汉语句法描述结构.将汉语功能块的自动标注问题看作序列标注任务,并使用词和基本块作为标注单元分别建立标注模型.针对不同的标注模型,分别构建基本块层面的特征集合,并使用条件随机场模型进行汉语功能块的自动标注.实验数据来自清华大学TCT语料库,并且按照8∶2的比例切分形成训练集和测试集.实验结果表明,与仅使用词层面信息的标注模型相比,基本块特征信息的适当加入可以显著提高功能块标注性能.当使用人工标注的基本块信息时,汉语功能块自动标注的准确率达到88.47%,召回率达到89.93%,F值达到89.19%.当使用自动标注的基本块信息时,汉语功能块的标注的准确率为84.27%,召回率为85.57%,F值为84.92%. In the schema of Chinese chunking, the words are firstly combined into base-chunks, and then the base-chunks are further combined into functional chunks, and finally formalized into a hierarchical syntactic structure. In this paper, the problem of automatic labeling of Chinese functional chunks is modeled as a sequential labeling task, and then words and base chunks are regarded as labeling units of the Chinese functional chunk labeling models. For each of the labeling models a series of new features on the level of base-chunks are constructed, and conditional random fields model is employed in the model. The data set in the experiments is selected from Tsinghua Chinese Treebank (TCT) corpus, and split into train set and test set according to the proportion of 8:2. The experimental results show tha( in comparison with the model in which the feature set at word level is only used, the addition of some base-chunk features can significantly improve the performance of functional chunk labeling. The proposed functional chunk labeling method based on human-corrected base-chunks can achieve precision of 88.47%, recall of 89.93% and F-measure of 89.19%. When auto-parsed base-chunks are Used, the labeling of Chinese functional chunks achieves precision of 84.27%, recall of 85.57% and F-measure of 84.92%.
出处 《计算机研究与发展》 EI CSCD 北大核心 2010年第2期336-343,共8页 Journal of Computer Research and Development
基金 国家自然科学基金项目(60873128) 山西省科技攻关计划基金项目(2007031126_01)~~
关键词 汉语基本块 汉语功能块 条件随机场模型 句法分析 序列标注 Chinese base chunk Chinese functional chunk conditional random fields syntactic parsing sequence labeling
  • 相关文献

参考文献8

  • 1Abney S. Partial parsing via finite-state cascades [C] //Proe of the ESSLLI '96 Robust Parsing Workshop. New York: Cambridge University Press, 1996: 337-344.
  • 2周强.汉语基本块描述体系[J].中文信息学报,2007,21(3):21-27. 被引量:25
  • 3Zhou Q, Drabek E F, Ren F. Annotating the functional chunks in Chinese sentences [C]//Proc of the 3rd Int Conf on Language Resources and Evaluation. Paris: European Language Resources Association, 2002 : 731-738.
  • 4周强,任海渡,詹卫东.构建大规模汉语语块库[M].北京:清华大学出版社,2001:102-107.
  • 5Ramshaw L, Marcus M. Text chunking using transformation-based learning[C] //Proc of the 3rd Workshop on Very Large Corpora. Stroudsburg, PA: Association for Computational Linguistics, 1995: 82-94.
  • 6Lafferty J, McCallum A, Pereira F. Conditional random fields: Probabilistic models for segmenting and labeling sequence data [C] //Proc of the 18th Int Conf on Machine Learning. San Francisco, CA: Morgan Kaufmann, 2001: 282-289.
  • 7Taku K. CRF ++ Toolkit [CP]. (2003-01-06)[2009-01- 02]. http://crfpp. sourceforge.net.
  • 8周强.基于规则的汉语基本块自动分析器[c]//第七届中文信息处理国际会议论文集(ICCC-2007).武汉,2007:137-142.

二级参考文献13

  • 1周强.汉语句法树库标注体系[J].中文信息学报,2004,18(4):1-8. 被引量:90
  • 2董振东.语义关系的表达和知识系统的建造[J].语言文字应用,1998(3):79-85. 被引量:58
  • 3徐通锵.语言论[M].吉林长春:东北师范大学出版社,1997..
  • 4Erik F. Tjong Kim Sang and Sabine Buchholz. Introduction to CoNLL-2000 Shared Task: Chunking [A].In: Proceedings of CoNLL 2000 and LLL-2000 [C].Lisbon, Portugal, 127-132.
  • 5Sang T K and D jean H. Introduction to the CoNLL2001 Shared Task: Clause Identification [A]. In:Proc. of CoNLL-2001 [C]. Toulouse, France, 53-57.
  • 6Carreras X. and Marquez, L. Introduction to the con-Ⅱ-2005 shared tasks: Semantic role labeling [A]. In:Proc. of CoNLL-2005 [C].
  • 7Ivan A. Sag, Timothy Baldwin, Francis Bond, Ann Copestake, and Dan Flickinger. Multiword Expressions: A Pain in the Neck for NLP [A]. In: Proc.Third International Conference of Computational Linguistics and Intelligent Text Processing (CICLing 2002) [C]. Mexico City, Mexico, February 2002. 17-23.
  • 8董振东,董强.关于知网中文信息结构库[A],http://www.keenage.com/,2000.
  • 9汉语基本短语标注规范[R].清华大学计算机系智能技术与系统国家重点实验室,技术资料,2002年2月.
  • 10Tiejun Zhao, Muyun Yang et al. Statistics Based Hybrid Approach to Chinese Base Phrase Identification[A]. In: Proc. of the Second Chinese Language Processing [C]. ACL 2000, Hong Kong.

共引文献26

同被引文献66

引证文献7

二级引证文献36

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部