摘要
词义知识获取是词义知识库建设、词义消歧等任务的基础和起点,目前该工作基本依赖人类专家的智慧和洞察力,在大规模文本处理上缺乏意义计算的客观性和一致性。该文以汉语的中高频形容词为样本,深入挖掘词义特征并采用有参数初始化过程的EM迭代算法,实现了从真实文本中自动发现并区分词语词义的过程。该词义区分算法选取易获取的词形特征、基于大规模语料的搭配特征、基于网络语料的属性—宿主关系特征,替代以往难以获取的句法结构特征,并进一步利用HowNet优化了词形特征的选择。该工作可以应用于信息检索等领域,能够对现有词典起到修改和补充的作用,该思路亦可扩展到其他汉语词类上去。
Lexieal knowledge acquisition is the bottleneck for many tasks like word sense disambiguation, lexieal knowledge base construction et al. This paper introduces an automatic word sense discrimination method for Chinese mid-high-frequency adjectives. We employ the EM algorithm and exploit the features of Chinese character, contextual bag-of-words and host-attribute pair instead of the more unreliable syntactic information. We further optimize the morphology selection by utilizing HowNet in our work. The experimental results show that word sense discrimination results are different from Chinese lexicons and could be used for lexicon modification and expansion even for other type of Chinese words.
出处
《中文信息学报》
CSCD
北大核心
2009年第6期19-25,共7页
Journal of Chinese Information Processing
基金
国家973课题资助项目(2004CB318102)
国家自然科学基金资助项目(60775031)
国家社科基金资助项目(08BYY060)
全国优秀博士学位论文作者专项资助项目(200514)
关键词
计算机应用
中文信息处理
知识获取
词义区分
特征选择
EM算法
computer application
Chinese information processing
knowledge acquisition
word sense discrimination
feature selection
EM algorithm