摘要
如何挖掘存储在手机上的大量短信息背后所隐含的会话信息,是一个非常具有挑战性的问题,因为它们并不具备"主题"、"回复"等经常被用于邮件线索分析的元数据.基于此,提出了一种基于时间聚类算法和话题检测的短信息会话识别模型.首先,根据短信息流的时间分布特性,将会话双方的所有短信息划分到一个一个的候选会话中,进而运用基于latent Dirichlet allocation(LDA)训练出来的语义话题模型,对候选会话进行更深层次的分析;利用该话题模型度量了各个候选会话在话题上的相关度.最后,在综合时间和话题相关度的基础上,通过对候选会话的合并识别出隐含的会话信息.通过对包含了50名大学生在6个月中产生的122 359条短信进行实验验证,证明了该算法的有效性.
Mining the latent conversations which are implied in the big amount of text messages stored on one’s mobile phone,is a challenging problem.They can hardly be organized by threads,due to lack of necessary metadata such as "subject" and "reply-to".This paper proposes an innovative conversation recognition model based on temporal clustering algorithms and topic detection methods.The study first clusters the text messages into candidate conversations based on their temporal attributes,and then does further analysis using a semantic model based on latent Dirichlet allocation(LDA).In the end,the text messages are organized as conversations based on their integrated correlation of temporal relevancy and topic relevancy.This approach is evaluated with a real dataset,which contain 122 359 text messages collected from 50 university students during 6 months.
出处
《软件学报》
EI
CSCD
北大核心
2012年第10期2586-2599,共14页
Journal of Software
基金
国家重点基础研究发展计划(973)(2009CB320504)
国家高技术研究发展计划(863)(2011AA01A101)