期刊文献+

汉语基本块规则的自动学习和扩展进化 被引量:6

Automatic learning and refinement algorithm for Chinese base chunk rules
原文传递
导出
摘要 为了从大规模标注语料库和词汇知识库支持下自动获取分层次、多粒度的规则描述知识,从汉语多词语基本块入手,提出一套完整处理方案。该方案从标注语料库中自动获取所有基于词类的基本块规则,通过设置规则置信度自动排除大量低可靠和无效规则。针对其中的高频低可靠规则,不断引入更多的内部词汇约束和外部语境限制知识,使之逐步进化为描述能力更强的结构化规则。同时提出一种预期精度指标对自动习得规则的描述能力进行了客观评价。实验结果表明:现有算法以16%的有效扩展规则覆盖了93%的标注正例,并使预期精度从51%提高到81%,显示了这套规则学习和评价方法的有效性。 A method is presented to automatically learn and refine Chinese base chunk rules, using a large annotated corpus and a lexical knowledge base. After extracting all possible parts-ofspeech-based rules from the annotated corpus, the system first prunes most of useless rules, and expands some low reliability rules with hierarchical knowledge from the internal lexical relationships and external contextual restrictions. The system then refines the rules into structural rules with.stronger descriptive capabilities. A confidence score computation is used to evaluate rule reliability during the learning procedure, with an expected accuracy index to evaluate the descriptive capabilities of the refined rule base. Test results indicate that the algorithm can acquire about 16% of the useful expanded rules to cover 93% of the annotated positive examples and can improve the expected accuracy from 51% to 81%.
作者 周强
出处 《清华大学学报(自然科学版)》 EI CAS CSCD 北大核心 2008年第1期88-91,共4页 Journal of Tsinghua University(Science and Technology)
基金 国家自然科学基金资助项目(60573185 60520130299)
关键词 信息处理 规则知识获取 基本块 置信度分析 知识约束进化 规则库评价 information processing rule knowledge acquisition base chunk confident score analysis restriction-based refinement rule base evaluation
  • 相关文献

参考文献5

  • 1Cardie C, Pierce D. The role of lexicalization and pruning for base noun phrase grammars [C]// Proceedings of the Sixteenth National Conference on Artificial Intelligence (AAAI-99). Orlando: AAAI Press, 1999: 423- 430.
  • 2Dejean H. Learning rules and their exceptions [J]. J Machine Learning Research, 2002, 148(3) : 669 - 693.
  • 3Choi M S, Lira C S, Choi K S. Automatic partial parsing rule acquisition using decision tree induction [C]// Dale R. Proceedings of IJCNLP 2005. Seoul, Korea: Spring LNAI, 3651, 2005: 143-154.
  • 4周强.汉语句法树库标注体系[J].中文信息学报,2004,18(4):1-8. 被引量:90
  • 5苑春法,许伟,黄昌宁.汉语语义关联网的研究[C]//陈力为,袁琦.语言工程.北京:清华大学出版社,1997:145-150.

二级参考文献21

  • 1戴浩一.概念结构与非自主性语法:汉语语法概念系统初探[J].当代语言学,2002,4(1):1-12. 被引量:109
  • 2Brants, S., & Hansen, S. (2002). Developments in the TIGER annotation scheme and their realization in the corpus[A]. In: Proceedings of the Third Conference on Language Resources and Evaluation (LREC-02)[C]. Las Palmas de Gran Canaria, Spain. 1643-164
  • 3Collins, M. (1999) Head-Driven Statistical Models for Natural Language Parsing[D]. Ph.D. Thesis. Dept. of Computer Science and Information, The University of Pennsylvania.
  • 4Hajic, J. (1999). Building a syntactically annotated corpus: The Prague Dependency Treebank[A]. In: E. Hajicova (Ed.), Issues of valency and meaning. Studies in honour of Jarmila Panevova. Prague, Czech Republic: Charles University Press.
  • 5Chu-Ren Huang, Feng-Yi Chen, Keh-Jiann Chen, & al.(2000). Sinica Treebank: Design Criteria, Annotation Guidelines, and On-line Interface[A], Proceedings of the Second Chinese Language Processing Workshop[C], HongKong. 29-37.
  • 6Kingsbury, P.; Martha Palmer, and Marcus, M. (2002). Adding Semantic Annotation to the Penn TreeBank[A]. In: Proceedings of the Human Language Technology Conference[C], San Diego, California.
  • 7Leech, G.; and Garside, R. (1991). Running a grammar factory: The production of syntactically analysed corpora or ‘treebanks' [A]. In: Stig Johansson and Anna-Brita Stenstrom (eds.) English Computer Corpora: Selected papers and Research Guide. 1991. 15-3
  • 8Marcus, M., Kim, G., Marcinkiewicz, M.,& al. (1994). The Penn Treebank: Annotating predicate argument structure [A]. In: Proc. of the ARPA Human Language Technology Workshop[C]. San Francisco, CA.
  • 9Mitchell P.Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini (1993). Building a Large Annotated Corpus of English: The Penn Treebank[J], Computational Linguistics, 19(2):313-330.
  • 10Stephan Oepen, Dan Flickinger, Kristina Toutanova, et. al. (2002). LinGO Redwoods-A Rich and Dynamic Treebank for HPSG [A]. In: Proc. of First Workshop on Treebanks and Linguistic Theories (TLT2002) [C]. 139-149.

共引文献89

同被引文献66

引证文献6

二级引证文献16

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部