摘要
互联网中存在大量的短文本信息流,需要对其进行会话抽取,将相同主题的内容合并到同一会话中。会话中的内容、时间和用户关系都会对会话抽取的性能产生影响,针对该问题提出了一种基于多策略的会话抽取算法。首先,基于内容、时间和用户关系进行会话分割得到会话片段;然后,利用词向量计算内容语义相似度,并结合时间信息计算会话片段间的相关度,对其进行聚类,实现会话抽取。在三个来源于真实聊天记录的数据集上进行实验的结果表明,本方法优于传统方法,综合F值分别提高了38.5%、15.7%和26.8%。
A large number of short text message streams are existing among the Internet. It is better to extract the conversations of the streams and cluster the messages of the same topic in the same conversation. By analyzing the impact of content,temporal and user connection in short text streams,this paper proposed a multiple strategies based novel conversation extraction method. Firstly,the method segmented the text stream into conversation segments based on content,temporal and user connection. Then,it calculated the semantic similarity based on word vectors,combined the temporal to calculate the relevancy to cluster the candidate conversation segments to complete the conversation extraction. Experimental results on 3 datasets of real chat logs show that this method works better than traditional methods,the average F increases by 38. 5%,15. 7% and26. 8%.
出处
《计算机应用研究》
CSCD
北大核心
2016年第4期997-1002,共6页
Application Research of Computers
基金
国家"863"计划资助项目(2011AA7032030D)
国家社会科学基金资助项目(14BXW028)
关键词
会话抽取
短文本
短文本信息流
词向量
聊天记录
conversation extraction
short text message
short text message stream
word vectors
chart log