摘要
设计未定义类型关系抽取系统是目前研究的热点.但在没有特定领域的、机器可读的知识作为指导的情况下,面向自然语言文本的关系抽取很难取得令人满意的精确度和召回率,约束可以有效辅助语义关系的抽取.本文描述了一个提取"实体-属性-值"关系的半监督的机器学习框架,在半监督学习任务中,种子主要从维基百科的信息表格中获取,首先用线性分类器找出一些强反例,然后迭代的使用已有的反例数据重新训练分类器再应用到余下的未标注数据上找出更多反例.经过半监督学习得到了一个关系候选实例集,接下来讨论了关系类别验证问题,对于噪声模式,给出关系模式置信度评价指标,对于冲突模式提出了控制匹配顺序(高置信度模式优先匹配的原则)算法.经过这两个算法后,关系类别的描述仍具一定的多样性,提出凝聚型层次聚类算法,该算法将维基百科描述的结构特征表示为向量{DW,CW,IW,BW},进而给出两个关系模式相关度计算模式,完成关系类别聚类.最后,在WikipediaXML数据集进行了相关的实验,结果表明:根据维基百科的结构特征,动态的确定关系类别,减少了对预定义类型的依赖,提高了关系识别系统的可移植性.
This study aims to design a relation extraction system with undefined relation type. However, without specific areas and machine-readable knowledge as a guide, it is difficult to achieve expected precision and recall in relation extraction for natural language texts. This paper describes a framework of extraction entity-attribute-value relationship based on semi-supervised machine learning. In semi-supervised learning tasks, seeds are obtained from the Wikipedia information table. We first identify some strong counter-example with a linear classifier, then re-train the classifier with the existing counter-example data, and finally find more counter-examples in remainingunannotated data. After semi-supervised learning, we can obtain a set of candidate relationship instances. Then we discuss the verification problem of the relationship categories. For the noise mode, we propose a standard evaluating relationship model confidence level. If modes have conflict, control match order algorithm will be presented(i, e. the principle of high confidence mode priority matching). After two algorithms, the relation type may be still with diversities, then the algorithm of condensed hierarchical clustering will be presented in this paper, which expresses Wikipedia as a vector, and give a computing mode of similar relational and complete relation type clustering. In the Wikipedia XML data sets experiments are conducted , and results show that according to Wikipedia, we can dynamically determine relation type, reduce the dependence on the predefined types, and improve the portability of relation recognition system.
出处
《南京大学学报(自然科学版)》
CAS
CSCD
北大核心
2012年第4期466-474,共9页
Journal of Nanjing University(Natural Science)
基金
国家自然科学基金(60873069)
江苏省研究生创新项目(CX99B204)
关键词
关系抽取
半监督学习
维基百科
实体-属性-值
relation extraction, semi-supervised learning, Wikipedia, entity-attribute-values