摘要
近年来,话题检测与追踪(TDT)得到广泛研究。然而,研究主要基于常规的新闻,扩展到短篇报道依然有问题。提出基于耿氏混合模型(DPMM)的话题识别方法,以统一的模型处理话题切分和TDT。介绍DPMM在话题识别中的应用以及讨论两种专门用来解决短篇报道的稀疏问题的方案。一个是算法流程,将话题识别的处理单元由单个短文本转为会话。另一个是扩展DPMM模型,当估算与已知的话题的关联词时考虑字的依赖。随后,通过同时处理话题切分和TDT来识别自发文本流的话题。DPMM模型的优势在于混合组件的数量不必提前确定,并且不需要话题数量与内容的前期准备,因此它更加适合流文本话题识别。实验结果表明,DPMM模型对处理短文本数据的话题识别是有效的。
Recently,topic detection and tracking(TDT) has been widely studied. However,the research is mainly based on conventional news,there is still the problem in extending it to short reports. In this paper,we raise the Dirichlet process mixture model(DPMM)-based topic recognition method,which deals with topic segmentation and TDT in a uniform model. We introduce the application of DPMM in topic recognition and discuss two methods which are specifically designed to solve sparseness problem associated with short text. One is the algorithm flow,it converts the single processing unit of topic recognition to session. The other uses extended DPMM model which considers word dependency when estimating the distributions of words associated with each known topic. Subsequently,we distinguish the topics of spontaneous text streams by simultaneously processing topic segmentation and TDT. The advantages of DPMM are the number of its mixture components does not need to be determined in advance,and it does not need early preparation about the number and content of topics,so it is more suitable for streaming topic recognition. Experimental result demonstrates that DPMM is efficacious in dealing with the topic recognition of short text.
出处
《计算机应用与软件》
CSCD
北大核心
2014年第8期191-195,共5页
Computer Applications and Software