摘要
针对Web地震新闻挖掘的需求,采用网络爬虫抓取新闻文本作为研究语料,采用改进的TF-IDF(Term Frequency-Inverse Document Frequency)算法对语料集进行文本训练,选取权值较大的特征词初步识别地震类文档;采用特征词构成要素描述地震事件,构建了地震事件的知识框架;基于框架的要素特征词匹配从地震类文档中获取候选事件语句,对候选事件语句进行句法分析,总结出地震要素出现形式和规律,构造抽取规则,编写抽取算法,完成了地震事件识别和提取实验,并对地震事件提取的精度进行分析和评价,验证了该方法具有较高的地震事件识别和提取精度,是一种有前景的Web专题事件挖掘的途径.
Aiming at the demands of earthquake news Web mining, the Web news texts are crawled as the research corpus; and an improved TF-IDF (Term Frequency-Inverse Document Frequency) algorithm is used for text training of corpus; and then the thematic words with the highest weights is selected to pre- liminarily identify the seismic texts; the four elements are used to described seismic events to build knowl- edge framework of seismic events recognition; the candidate thematic sentences from the seismic texts are obtained through the thematic word matching, and syntax analysis of candidate sentences are conducted; through summing up how the seismic elements appeared in sentences, and then the extraction rules are constructed and extraction algorithm is coded, and the seismic event identification and extraction experiments are fulfilled. Finally, the extraction accuracy of seismic events are analyzed and evaluated, so as to verify that the method proposed has a higher precision of seismic event identification and extraction, which is a promising approach of thematic event Web mining.
出处
《武汉大学学报(工学版)》
CAS
CSCD
北大核心
2018年第2期183-188,共6页
Engineering Journal of Wuhan University
基金
国家自然科学基金资助项目(编号:41471323)
测绘遥感信息工程国家重点实验室专项科研经费资助
关键词
Web地震新闻
信息挖掘
事件框架
文本分析
Web earthquake news linformation mining
event framework
syntactic analysis