期刊文献+

基于词典信息的先秦汉语全文词义标注方法研究 被引量:5

A Study in Dictionary-Based All-word Word Sense Disambiguation for Pre-Qin Chinese
下载PDF
导出
摘要 词义消歧是自然语言处理中的一项基础任务,古汉语信息处理也急需深层次的语义标注工作。该文针对先秦古汉语这一特殊的语言材料,在训练语料和语义资源匮乏的条件下,采用《汉语大词典2.0》作为知识来源,将其词条释义作为义类,每个义项的例句作为训练语料,使用基于支持向量机(SVM)的半指导方法对《左传》进行全文的词义标注。按照频度不同、义项数量不同的原则,我们随机选取了22个词进行了人工检查,平均正确率达到67%。该方法可以广泛用于缺乏训练语料的古汉语义项标注工作,能够在古汉语全文词义标注的起步阶段提供初始结果,为人工标注词语义项提供良好的数据底本,补正传统词典释义不全的问题,进一步丰富汉语史发展研究资料。 Word Sense Disambiguation(WSD) is a basic task of Natural Language Processing,including the processing of ancient Chinese documents.In this paper we focuse on the specific field of analyzing pre-Qin ancient Chinese documents.Considering the shortage of training data and semantic resources,we employe a semi-supervised machine learning method to perform all-word WSD of Zuo Zhuan and use Chinese Dictionary v2.0 as the knowledge resource.We randomly selecte 22 words of different frequency and sense number to evaluate the proposed method.On the selected words,our method achieves an average accuracy of 67%,which is significant higher than the baseline method of selecting the most frequent sense.This method is promising for sense tagging of ancient Chinese documents when there is no training data available.It also provides a raw sense tagging result for human correction,enriching traditional dictionaries which usually suffer from insufficient word sense entries.
出处 《中文信息学报》 CSCD 北大核心 2012年第3期65-71,103,共8页 Journal of Chinese Information Processing
基金 先秦文献词汇知识挖掘资助项目(2010JDXM023) 211项目"先秦汉语词汇统计与知识检索" 国家社会科学基金资助项目(10&ZD117 10CYY021 08BYY054)
关键词 词义消歧 义项标注 古汉语 自然语言处理 word sense disambiguation sense tagging ancient Chinese natural language processing
  • 相关文献

参考文献13

  • 1Pradhan,S.,Loper,E.,Dligach,D.,et al.Semeval-2007task-17:English lexical sample srl andall words[C] //Proceedings of SemEval-2007,ACL,2007,87-92.
  • 2汉语大词典2.0[CD].商务印书馆(香港).2005.
  • 3董志翘.为中古汉语研究夯实基础——“中古汉语研究型语料库”建设琐议[J].燕山大学学报(哲学社会科学版),2011,12(1):1-6. 被引量:10
  • 4于丽丽,丁德鑫,曲维光,陈小荷,李惠.基于条件随机场的古汉语词义消歧研究[J].微电子学与计算机,2009,26(10):45-48. 被引量:13
  • 5Lesk.M.Automatic sense disambiguation usingmachine readable dictionaries:how to tell a pineconefrom an ice cream cone[C] //Proceedings of the 5thannual international conference on Systemsdocumentation,1986:24-26.
  • 6Patwardhan,S.,Banerjee,S.,Pedersen,T.Usingmeasures of Semantic Relatedness for Word SenseDisambiguation[C] //Proceedings of CICLing,2003:241-257.
  • 7Pedersen,T.,Banerjee,S.,Patwardhan,S.Maximizingsemantic relatedness to perform word sense disambiguation[R].Minneaplis:University of MinnesotaSupercomputing Institute,Res.rep:UMSI 2005/25,2005.
  • 8Sinha,R.,Mihalcea,R.Unsupervised graph-basedword sense disambiguation using measures of wordsemantic similarity[C] //Proceedings of the IEEEInternational Conference on Semantic Computing,2007:363-369.
  • 9Agirre E.,Soroa A.Personalizing PageRank for wordsense disambiguation[C] //Proceedings of the 12thConference of the European Chapter of the Associationfor Computational Linguistics,2009:33-41.
  • 10Yarowsky D.Unsupervised Word-SenseDisambiguation Rival Supervised Methods[C] //Proceeding of the 33rd Annual Meeting of theAssociation for Computational Linguistics,1995:189-196.

二级参考文献17

  • 1尉迟治平.计算机技术和汉语史研究[J].古汉语研究,2000(3):56-60. 被引量:19
  • 2全昌勤,何婷婷,姬东鸿,刘辉.从搭配知识获取最优种子的词义消歧方法[J].中文信息学报,2005,19(1):30-35. 被引量:12
  • 3张万起.世说新语词典[M].北京:商务印书馆,1998..
  • 4董振东,董强.知网[DB/OL].[2009-02-19].http://www.keenage.com.
  • 5白拴虎.汉语词切分及词性标注一体化方法[C]//计算语言学进展与应用.北京:清华大学出版社,1995:56-61.
  • 6Hwee Tou Ng and Jin Kiat Low. Chinese Part-of- Speech Tagging: One at-a-Time or All-at-Once? Word-Based or Character-Based? [C]//Proceedings of ACL-04: 277-284.
  • 7Yue Zhang and Stephen Clark. Joint Word Segmentation and POS Tagging using a Single Perceptron[C]// Proceedings of ACL-08 : 888-896.
  • 8魏培泉 黄居仁 等.建构一个以共时与历时语言研究为导向的历史语料库.中文计算语言学期刊,1997,2(1):131-145.
  • 9叶正道.记《新编汉文典》:一部探究汉语言概念范畴网络的历史和比较类书[M].台北:汉学研究通讯,2004.
  • 10王云路.六朝诗歌词语研究[M].哈尔滨:黑龙江教育出版社,1999.

共引文献79

同被引文献116

引证文献5

二级引证文献75

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部