摘要
传统信息抽取针对特定的领域。当转换到新领域时,需要人工编写新的抽取规则和人工标记新的训练样本。开放信息抽取突破了传统信息抽取的局限性。现有的开放式信息抽取系统大多针对英文,然而,目前对于中文的研究相对较少,并主要以抽取三元组为主,没有针对中文抽取多元组的方法。因此提出了一种基于依存分析的中文开放式多元实体关系抽取方法。首先,对文本集进行预处理和依存关系分析;然后将动词视为候选关系词,将与此动词有满足条件的有效依存路径的基本名词短语视为实体词,关联两个及两个以上的实体词的关系词可与实体词组成候选多元实体关系组;最后,使用经过训练的逻辑回归分类器对多元实体关系组进行过滤。对百度百科数据集的抽取结果显示,所提方法在抽取大量实体关系多元组时准确性可达到81%。
Traditionally,information extraction(IE)has focused on satisfying precise,narrow,pre-specified requests from small homogeneous corpora.Shifting to a new domain requires the user to name the target relations and to manually create new extraction rules or hand-tag new training examples.Open information extraction(OIE)overcomes the limitations of traditional IE techniques,which trains individual extractors for every single relation type.Present studies have attracted much attention on English OIE.However,few studies have been reported on OIE for Chinese.This paper presented a N-ary Chinese OIE system(N-COIE).N-COIE preprocesses the sentences using the nature language processing tools,and then extracts entity-relation groups from the preprocessed sentences.Finally,N-COIE filters entityrelation groups using the trained logistic regression classifier.Empirical results show the effectiveness of the proposed system.
出处
《计算机科学》
CSCD
北大核心
2017年第S1期80-83,共4页
Computer Science
基金
基于框架语义标注的中文篇章指代消解策略研究(2012011011-2)资助