摘要
信息抽取研究已经从传统的限定类别、限定领域信息抽取任务发展到开放类别、开放领域信息抽取。技术手段也从基于人工标注语料库的统计方法发展为有效地挖掘和集成多源异构网络知识并与统计方法结合进行开放式信息抽取。该文在回顾文本信息抽取研究历史的基础上,重点介绍开放式实体抽取、实体消歧和关系抽取的任务、难点、方法、评测、技术水平和存在问题,并结合课题组的研究积累,对文本信息抽取的发展方向以及在网络知识工程、问答系统中的应用进行分析讨论。
The research on information extraction is being developed into open information extraction,i.e.extracting open categories of entities,relations and events from open domain text resources.The methods used are also transferred from pure statistical machine learning model based on human annotated corpora into statistical learning model incorporated with knowledge bases mined from large-scaled and heterogeneous Web resources.This paper firstly reviews the history of the researches on information extraction,then detailedly introduces the task definitions,difficulties,typical methods,evaluations,performances and the challenges of three main open domain information extraction tasks,i.e.entity extraction,entity disambiguation and relation extraction.Finally,based on our researches on this field,we analyze and discuss the development directions of open information extraction research and its applications in large-scaled knowledge engineering,question answering,etc.
出处
《中文信息学报》
CSCD
北大核心
2011年第6期98-110,共13页
Journal of Chinese Information Processing
基金
国家自然科学基金资助项目(60875041
61070106)
关键词
开放式信息抽取
知识工程
文本理解
open information extraction
knowledge engineering
text understanding