摘要
部分整体关系是一种基础而重要的语义关系,从文本中自动获取部分整体关系是知识工程的一项基础性研究课题。该文提出了一种基于图的从Web中获取部分整体关系的方法,首先利用部分整体关系模式从Google下载语料,然后用并列结构模式从中匹配出部分概念对,据此形成图,用层次聚类算法对该图进行自动聚类,使正确的部分概念聚集在一起。在层次聚类基础上,我们挖掘并列结构的特性、图的特点和汉语的语言特点,采用惩罚逗号边、去除低频边、奖励环路、加重相同后缀和前缀等5种方法调整图中边的权重,在不损失层次聚类的高准确率条件下,大幅提高了召回率。
Automatic discovery of part-whole relations from the Web is a fundamental but critical problem in knowl- edge engineering. This paper proposes a graph based method of extracting part-whole relations from the Web. Firstly, we download snippets from Google using part-whole query patterns, and then we built a graph by extracting word pairs with a coordinate structure from these snippets, with the co-occurring words as nodes and the frequency count as edges' weight. A hierarchical clustering method is used to cluster the correct parts, which is optimized by five methods of adjusting the edge weight: reduce the weight of comma-edges, cut the low-frequency edges, enlarge the weight of edges in the loop, enlarge the weight of edges in which two nodes share the same suffix, and enlarge the weight of edges in which two nodes share the same prefix. Experimental results show that the five methods in- crease the recall substantially.
出处
《中文信息学报》
CSCD
北大核心
2015年第1期88-96,共9页
Journal of Chinese Information Processing
基金
国家自然科学基金(91224006
61173063
61035004
61203284
309737163)
国家社科基金(10AYY003)
关键词
部分整体关系
图模型
并列结构
层次聚类
边权重
part whole relations
graph model
coordinate structure
hierarchical clustering
edge weight