摘要
针对新闻数据流事件检测算法在实时性、准确率等方面存在的问题,提出一种面向新闻数据流的在线事件检测方法.事件的发生往往伴随着构成该事件的特征(即关键词)在相应时间段内出现的频率明显上升,将这些特征称为突发特征.运用分布拟合检验检测构成新闻数据流的特征在某一时间段内新闻报道中出现频率的分布是否发生明显变化,并进一步利用左边检验确认该时间段内的所有突发特征.分析突发特征的相关性,采用进化谱聚类算法将相关性较高的突发特征聚类在一起构成事件.在路透社新闻数据集第一卷上应用了本算法,验证了该方法能够有效地发现突发特征,并实时地检测出发生的事件,检测出的事件同实际事件有很高的符合度.
Event detection in news stream is an important research area in topic detection and tracking community.Unfortunately,most of the existing event detection methods are offline and inaccurate.An online event detection algorithm in news stream was introduced.An event consists of a set of bursty features that demonstrates bursty rises in corresponding keywords frequency as the related events emerge.Goodness-of-fit test was applied to find out these features with obvious changes in distribution of term frequency in a news document.Left side significance test was further used to validate all the bursty features occurred in a time span.Finally,evolutionary spectral clustering was applied to group highly correlated bursty features into bursty events.Experiments on the Reuters Corpus Volume 1 show that the proposed method can effectively identify bursty features and timely detect events.The detected events are consistent with corresponding events in real life.
出处
《浙江大学学报(工学版)》
EI
CAS
CSCD
北大核心
2011年第6期1006-1012,共7页
Journal of Zhejiang University:Engineering Science
基金
国家科技支撑计划资助项目(2008BAH26B00)
关键词
在线事件检测
进化谱聚类
假设检验
新闻数据流
online event detection
evolutionary spectral clustering
hypothesis test
news stream