摘要
话题发现是网络社交平台上进行热点话题预测的一个重要研究问题。针对已有话题发现算法大多基于传统余弦相似度衡量文本数据间的相似性,无法识别各维度取值成比例变化时数据对象间的差异,文本数据相似度计算结果不准确,影响话题发现正确率的问题,提出基于双向改进余弦相似度的话题发现算法(TABOC),首先从方向和取值两个角度改进余弦相似度,提出双向改进余弦相似度,能够区分各维度取值成比例变化的数据对象,保留传统余弦相似度在方向判别上的优势,提高衡量文本相似度的准确性;进一步定义集合的双向改进余弦特征向量和双向改进余弦特征向量的加法等相关定义定理,舍弃无关信息,直接计算新合并集合的特征向量,减小话题发现过程中的时间和空间消耗;还结合增量聚类框架,高效处理新增数据。采用百度贴吧数据进行实验表明,TABOC算法进行话题发现是有效可行的,算法正确率和时间效率总体上优于其他对比算法。
Topic detection is an important research issue for the trending topics prediction on social networking platforms.Most of the existing topic detection algorithms measure the similarity between text data using cosine similarity,which cannot recognize the differences between data objects when the values of each dimension change proportionally resulting in inaccurate similarity computing results and affecting the accuracy of topic detection.A topic detection algorithm based on bilateral cosine similarity(TABOC)is proposed in this paper.Firstly,we propose bilateral cosine similarity,which can not only distinguish data objects whose values change proportionally of each dimension,but also retain the advantage of cosine similarity in direction discrimination to improve the accuracy of the text similarity from two perspectives of direction and value.Next,the bilateral cosine feature vector of a set and the additivity of the bilateral cosine feature vector are defined.The unrelated information is abandoned and the bilateral cosine feature vectors of new merged clusters can be directly computed.In this way,the time and space complexities are reduced.Finally,we propose the topic detection algorithm combining with the incremental clustering paradigm,and the new data can be processed efficiently.The experiments with data from Baidutieba show that TABOC is effective and feasible for topic detection,and the accuracy and time efficiency of the algorithm are generally better than other baselines.
作者
武森
高晓楠
何慧霞
WU Sen;GAO Xiao-nan;HE Hui-xia(School of Economics and Management, University of Science and Technology Beijing, Beijing 100083, China)
出处
《运筹与管理》
CSSCI
CSCD
北大核心
2021年第2期75-83,共9页
Operations Research and Management Science
基金
国家自然科学基金资助项目(71271027,71971025)。
关键词
网络社交平台
话题发现
双向改进余弦相似度
特征向量
增量聚类
social networking platforms
topic detection
bilateral cosine similarity
feature vector
incremental clustering