期刊文献+

基于层叠条件随机场的哈语树库构建技术研究

RESEARCH ON THE TECHNOLOGY OF BUILDING KAZAKH TREEBANK BASED ON CASCADED CONDITIONAL RANDOM FIELD
下载PDF
导出
摘要 针对如何提高基于统计的哈萨克语句法分析算法的处理性能问题,提出一种通过人机交互来构建哈萨克语树库的方法。在自动句法标注阶段,采用层叠条件随机场模型实现,并在其低层与高层模型之间加入改进的基于转换的错误驱动学习算法来进行简单句的自动句法标注及自动校正。最后对特殊的整体标记错误进行人工校对,形成基于短语结构的哈萨克语树库。实验结果表明,该方法在很大程度上减少了人力及物力的投入,提高了分析精度及整体处理效率,并为后期基于哈萨克语的句法机器翻译及文本挖掘奠定了一定的基础。 On the issue of how to improve the processing performance of statistical analysis-based Kazakh syntax parsing algorithm,this paper proposes a method of constructing the Kazakh treebank by human-computer interaction. In automatic syntax annotation stage,it achieves by using the cascade conditional random field model. And between its low-level and high-level models it adds the improved and transformation-based error-driven learning algorithm to carry out automatic syntax annotation and automatic correction of the simple sentences.Finally for special entire marking errors the artificial proofreading will be conducted,thus the method forms the phrase structure-based Kazakh treebank. Experimental results show that this method reduces to a large extent the investment on human power and material resources,improves the parsing accuracy and overall processing efficiency. Moreover,it lays the certain foundation for the Kazakh-based syntactic machine translation and text mining afterwards.
出处 《计算机应用与软件》 CSCD 2016年第3期71-75,82,共6页 Computer Applications and Software
基金 国家自然科学基金项目(61063025 61363062)
关键词 哈萨克语树库 人机交互 层叠条件随机场 错误驱动学习算法 Kazakh treebank Human-machine interaction Cascade conditional random fields Error-driven learning algorithm
  • 相关文献

参考文献16

  • 1Nianwen Xue,Fu Dong Chiou,Martha Palmer.Building a Large-Scale Annotated Chinese Corpus[C]//Proc.of 19th International Conference on Computational Linguistics(COLING-02),Taiwan,2002:1-7.
  • 2Chu Ren Huang,Feng Yi Chen,Zhao ming Gao,et al.Sinica Treebank:design criteria,annotation guidelines,and on-line interface[C]//Proceedings of the Second Workshop Chinese Language Processing,Hong Kong,2000:29-37.
  • 3Wojciech Skut,Thorsten Brants,Brigitte Krenn,et al.A linguistically interpreted corpus of German Newspaper text[C]//Proceedings of the Conference on Language Resources and Evaluation LREC-98.Granade,Spain,1998:705-711.
  • 4Sabine Brants,Silvia Hansen.Developments in the TIGER annotation scheme and their realization in the corpus[C]//Proceedings of the Third Conference on Language Resources and Evaluation(LREC-02).Las Palmas de Gran Canaria,Spain,2002:1643-1649.
  • 5古丽拉.阿东别克,达吾勒.阿布都哈依尔,木合亚提.尼亚孜别克,刘晓洁.现代哈萨克语词级标注语料库的构建研究[J].新疆大学学报(自然科学版),2009,26(4):394-401. 被引量:7
  • 6侯呈风,古丽拉.阿东别克,陈景超.基于HMM的哈萨克语词性标注研究[J].计算机应用与软件,2012,29(2):31-33. 被引量:3
  • 7孙瑞娜,古丽拉.阿东别克.哈萨克语基本名词短语自动识别研究与实现[J].中文信息学报,2010,24(6):114-119. 被引量:11
  • 8周强,张伟,俞士汶.汉语树库的构建[J].中文信息学报,1997,11(4):42-51. 被引量:32
  • 9周强,任海波,孙茂松.分阶段构建汉语树库[C]//第二届中日自然语言处理专家研讨会,2006,5:189-197.
  • 10周强,俞士汶.汉语短语标注标记集的确定[J].中文信息学报,1996,10(4):1-11. 被引量:35

二级参考文献59

  • 1张锋,许云,侯艳,樊孝忠.基于互信息的中文术语抽取系统[J].计算机应用研究,2005,22(5):72-73. 被引量:36
  • 2周明,黄昌宁.面向语料库标注的汉语依存体系的探讨[J].中文信息学报,1994,8(3):35-52. 被引量:40
  • 3华沙宝,达胡白乙拉.对蒙古语语料库基本名词短语的定界与统计分析[J].中文信息学报,2005,19(5):52-58. 被引量:4
  • 4周强,俞士汶.汉语短语标注标记集的确定[J].中文信息学报,1996,10(4):1-11. 被引量:35
  • 5冯志伟.中国语料库研究的历史与现状.Journal of Chinese Language and Computing,2002,11(2):127-136.
  • 6Galcin Cebi and GSkhan Dalkilic. Turkish Word N-gram Analyzing Algorithms for a Large Scale Turkish Corpus Turco[C]. Proceedings of the International Conference on Information Technology: Coding and Computing(ITCC'04), 2004.
  • 7Eric Brill. A Simple rule-based part of speech tagger[C]. Proc. of the Third conference on Applied Natural Language Processing(ACL), Trento Italy, 1992, 152-155.
  • 8Evangelos Dermatas, George K. Automatic Stochastic Tagging of Natural Language Texts[J]. Computational linguistics, 1995, 21(2): 137-163.
  • 9新疆哈萨克自治区语委会.现代哈萨克语[M].乌鲁木齐:新疆人民出版社,2002.
  • 10《哈萨克语详解词典》.乌鲁木齐:新疆人民出版社,1998.

共引文献89

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部