摘要
该文设计了一个热点事件发现系统。该系统面向互联网新闻报道流,能自动发现任意一段时间内网络上的热点事件,并给出描述事件发展过程的曲线图。针对网络新闻语料具有数据规模大和时间特征明显两个特性,系统将语料按时间(天)分组,对每天的语料采用凝聚聚类得到微类,选取某段时间内的所有微类,再做Single-pass聚类得到事件列表,利用事件热度计算公式,把候选事件按热度进行排序。采用该系统对2007年新闻语料进行实验,结果表明该系统能取得较好的效果。
We propose a system to detect hot web event automatically. The system is focused on the stream of news report on the Internet, which provides a diagram concerning the tendency of the event and can be utilized to detect the hot web event in any period of time. Since news corpus is characterized by large scale data and distinct time leatures, it is divided into hundreds of groups according to the date. We further divide each group into some macroclusters using the agglomerative clustering, select the macro-clusters during a certain period of time and then combine all these selected macro-clusters into event lists by the Single pass clustering. Finally, we sort the candidate events by calculating their hot degree. Experiments on 2007 news corpus show that our system can produce satisfactory results.
出处
《中文信息学报》
CSCD
北大核心
2008年第6期80-85,共6页
Journal of Chinese Information Processing
基金
国家自然科学基金资助项目(60773167)
湖北省自然科学基金资助项目(2006ABC011)
973国家重点基础研究发展计划资助项目(2007CB310804)
教育部/国家外国专家局高等学校学科创新引智计划资助项目(B07042)
国家“十一五”科技支撑计划资助项目(2006BAK11B03)
关键词
计算机应用
中文信息处理
事件发现
凝聚聚类
Single-pass聚类
热度计算
computer application
Chinese information processing
event detection
agglomerative clustering
singlepass clustering
hot degree calculation