期刊文献+

未定义类型的关系抽取的半监督学习框架研究 被引量:7

A study of relation extraction of undefined relation type based on semi-supervised learning framework
下载PDF
导出
摘要 设计未定义类型关系抽取系统是目前研究的热点.但在没有特定领域的、机器可读的知识作为指导的情况下,面向自然语言文本的关系抽取很难取得令人满意的精确度和召回率,约束可以有效辅助语义关系的抽取.本文描述了一个提取"实体-属性-值"关系的半监督的机器学习框架,在半监督学习任务中,种子主要从维基百科的信息表格中获取,首先用线性分类器找出一些强反例,然后迭代的使用已有的反例数据重新训练分类器再应用到余下的未标注数据上找出更多反例.经过半监督学习得到了一个关系候选实例集,接下来讨论了关系类别验证问题,对于噪声模式,给出关系模式置信度评价指标,对于冲突模式提出了控制匹配顺序(高置信度模式优先匹配的原则)算法.经过这两个算法后,关系类别的描述仍具一定的多样性,提出凝聚型层次聚类算法,该算法将维基百科描述的结构特征表示为向量{DW,CW,IW,BW},进而给出两个关系模式相关度计算模式,完成关系类别聚类.最后,在WikipediaXML数据集进行了相关的实验,结果表明:根据维基百科的结构特征,动态的确定关系类别,减少了对预定义类型的依赖,提高了关系识别系统的可移植性. This study aims to design a relation extraction system with undefined relation type. However, without specific areas and machine-readable knowledge as a guide, it is difficult to achieve expected precision and recall in relation extraction for natural language texts. This paper describes a framework of extraction entity-attribute-value relationship based on semi-supervised machine learning. In semi-supervised learning tasks, seeds are obtained from the Wikipedia information table. We first identify some strong counter-example with a linear classifier, then re-train the classifier with the existing counter-example data, and finally find more counter-examples in remainingunannotated data. After semi-supervised learning, we can obtain a set of candidate relationship instances. Then we discuss the verification problem of the relationship categories. For the noise mode, we propose a standard evaluating relationship model confidence level. If modes have conflict, control match order algorithm will be presented(i, e. the principle of high confidence mode priority matching). After two algorithms, the relation type may be still with diversities, then the algorithm of condensed hierarchical clustering will be presented in this paper, which expresses Wikipedia as a vector, and give a computing mode of similar relational and complete relation type clustering. In the Wikipedia XML data sets experiments are conducted , and results show that according to Wikipedia, we can dynamically determine relation type, reduce the dependence on the predefined types, and improve the portability of relation recognition system.
作者 程显毅 朱倩
出处 《南京大学学报(自然科学版)》 CAS CSCD 北大核心 2012年第4期466-474,共9页 Journal of Nanjing University(Natural Science)
基金 国家自然科学基金(60873069) 江苏省研究生创新项目(CX99B204)
关键词 关系抽取 半监督学习 维基百科 实体-属性-值 relation extraction, semi-supervised learning, Wikipedia, entity-attribute-values
  • 相关文献

参考文献14

  • 1Suchanek F M, Kasneei G, Weikum G. YA- GO: A Large ontology from Wikipedia Word- Net. Elsevier Journal of Web Semantics, 2008, lZ45-1251.
  • 2Etzioni O, Cafarella M, Downey D,etal. Web- seale information extraction in knowitall. WWW, New York, 2004,341 - 349.
  • 3ACE. The nist ace evaluation website, http:// www. nist. gov/speech/tests/ace/ace07/,2007.
  • 4Auer S, Lehmann J. What have Innsbruck and Leipzig in common? Extracting Semantics from Wiki Content. Proceedings of the 4th European Semantic Web Conference, June 2007 in the Ty- rol region of Innsbruck, Austria, 2007, 121--132.
  • 5Suchanek F M, Kasneci G, Weikum G. YA- GO.. A Core of Semantic Knowledge Unifying WordNet and Wikipedia. Proceedings of the 16^th International World Wide Web Conference, National University of Ireland, Galway, 2007, 443-448.
  • 6Girju R, Badulescu A, Moldovan D. Learning semantic constraints for the automatic discovery of part-whole relations. Proceedings of HLT NAACL '03, University of Montrral, Canada, 2003, 612-618.
  • 7Roth D, Yih W. A linear programming formu- lation for global inference in natural language tasks. Proceedings of the 8^th International Con- ferenee on Computational Natural Language Learning, Ayderabad, India, 2004,23-30.
  • 8Ruiz-Casado M, Alfonseca E, Castells P. Au- tomatic extraction of semantic relationships for WordNet by means of pattern learning from Wikipedia. Proceedings of the 10^th International Conference on Applications of Natural Language to Information Systems, Montoyo, Rafeal Mvnoz, Elisath Matais, 2005,224-231.
  • 9Denoyer L. The Wikipedia XML Corpus. SI GIR Forum, 2006.
  • 10Apache Software Foundation. OpenNLP. http ://opennlp. sourceforge, net/, 2010.

二级参考文献41

  • 1孙明欣,尹存燕,戴新宇,陈家骏.一种基于元规则的自然语言生成规则解释技术[J].南京大学学报(自然科学版),2006,42(1):69-75. 被引量:1
  • 2耿焕同,蔡庆生,于琨,赵鹏.一种基于词共现图的文档主题词自动抽取方法[J].南京大学学报(自然科学版),2006,42(2):156-162. 被引量:30
  • 3李杨,都思丹.小波域分形编码数字水印的研究(英文)[J].南京大学学报(自然科学版),2006,42(4):373-383. 被引量:2
  • 4Brassil J,Low S,Maxemchuk N.Copyright protection for the electronic distribution of text documents.Proceedings of the 1999 Institute of Electrical and Electronics Engineers IEEE,1999,87(7):1181-1196.
  • 5Amano T,Misaki D.A feature calibration method for watermarking of document image.Proceedings of the 1999 International Conference on Document Analysis and Recognition.Bangalore,1999,91-94.
  • 6张力,袁灯山,尹树田.一种文档加密方法.中国专利,1740943,2006-03-01.
  • 7Wu M,Liu B.Data hiding in binary image for authentication and annotation.IEEE Transactions on Multimedia,2004,6(4):528-538.
  • 8Yang H,Kot A C.Pattern-based data hiding for binary image authentication by connectivitypreserving.IEEE Transactions Multimedia,2007,9(3):475-485.
  • 9Borges P,Mayer J.Text luminance modulation for hardcopy watermarking.Signal Processing,2007,87(7):1754-1771.
  • 10杨斌,史文哲.一种在文本文档中嵌入及检测数字水印的方法和装置.中国专利,1790420,2006-06-21.

共引文献1

同被引文献81

引证文献7

二级引证文献32

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部