摘要
非结构化数据的结构化任务是大数据环境下管理信息系统面临的新课题。该文从文体的角度研究自由文本的特性,提出了从Web新闻中抽取突发事件属性的方法,该方法首先分析研究了Web文本和新闻文体的特征,利用Google Word2Vec对领域专家构建的词表进行扩展,针对突发事件的不同属性制定了不同的抽取方法:采用词表实现事件分类,采用文体特征进行时间、事件摘要的抽取,采用文体和词表进行地点、伤亡情况和经济损失属性的抽取。实验表明,采用基于文体和词表方法在爬取的Web新闻语料库和公开语料库进行突发事件的属性进行抽取时,平均准确率分别为87.89%、91.29%,平均召回率分别为81.76%、87.91%,能满足应急管理需求。
With the development of Big Data,one of necessities of management information system is to structure tons of non-or semi-structured data.The paper proposed a solution to extract the attributes of emergencies from Web pages.Based on study of Web page structure and style of news,the paper expanded the existing terminology by Google Word2 Vec,and proposed different ways from different attributes of emergencies:terminology for classification,style for date/time and abstract,style and terminology for location,casualty and loss.Experiment result showed that the solution's average accuracy were 87.89%,91.29% and the average recall were 81.76%,87.91% on both Web news set and published emergency corpus,which was high enough to meet the requirement of emergency management.The idea of information extraction proposed in this paper has practical value for free text information extraction in other application fields.
作者
邱奇志
周三三
刘长发
陈晖
QIU Qizhi;ZHOU Sansan;LIU Changfa;CHEN Hui(School of Computer Science and Technology,Wuhan University of Technology,Wuhan,Hubei 430000,China)
出处
《中文信息学报》
CSCD
北大核心
2018年第9期56-65,74,共11页
Journal of Chinese Information Processing
基金
安全预警与应急联动技术湖北省协同创新中心开放课题(JD20150507)
关键词
文体
词表
信息抽取
突发事件
style
terminology
information extraetion
emergency