摘要
话题发现与跟踪是一项评测驱动的研究,旨在依据事件对语言文本信息流进行组织利用。自1996年提出以来,该研究得到了越来越广泛的关注。本文在研究已有成熟算法的基础上,提出了基于分治多层聚类的话题发现算法,其核心思想是把全部数据分割成具有一定相关性的分组,对各个分组分别进行聚类,得到各个分组内部的话题(微类),然后对所有的微类再进行聚类,得到最终的话题,在聚类的过程中采用多种策略进行优化,以保证聚类的效果。基于该算法的系统在TDT4中文语料上进行了测试,结果表明该算法属于目前结果最好的算法之一。
Topic Detection and Tracking is a research driven by evaluation, which intends to organize and utilize information stream of texts according to event. Since being brought forward in 1996,it comes under more and more attention. This paper an algorithm of division and multi-level clustering with multi-strategy optimization, which bases on study of today's mature algorithms. The core thought of the algorithm is to divide all data into groups (each group has intrinsic relevance),and cluster in each group to produce micro-dusters,and then cluster on all micro-clusters to result in final topics. During the process, various strategies are employed to improve the effect of clustering. The system implemented with the algorithm has been tested on TDT4 corpus. The test indicates the algorithm is one tin,sent best algorithm.
出处
《中文信息学报》
CSCD
北大核心
2006年第1期29-36,共8页
Journal of Chinese Information Processing
基金
国家973资助项目(2004CB318109)
关键词
计算机应用
中文信息处理
话题发现与跟踪
分治多层聚类
系统聚类
computer application
Chinese information processing
topic detection and tracking
division and multi-level clustering
hierarchical clustering