摘要
[研究目的]针对主流话题发现模型存在数据稀疏、维度高等问题,提出了一种基于突发词对主题模型(BBTM)改进的微博热点话题发现方法(BiLSTM-HBBTM),以期在微博热点话题挖掘中获得更好的效果。[研究方法]首先,通过引入微博传播值、词项H指数和词对突发概率,从文档层面和词语层面进行特征选择,解决数据稀疏和高维度的问题。其次,通过双向长短期记忆(BiLSTM)训练词语之间的关系,结合词语的逆文档频率作为词对的先验知识,考虑了词之间的关系,解决忽略词之间关系的问题。再次,利用基于密度的方法自适应选择BBTM的最优话题数目,解决了传统的主题模型需要人工指定话题数目的问题。最后,利用真实微博数据集在热点话题发现准确度、话题质量、一致性三个方面进行验证。[研究结论]实验表明,BiLSTM-HBBTM在多种评价指标上都优于对比模型,实验结果验证了所提模型的有效性及可行性。
[Research purpose]Aiming at the problems of sparse data and high dimension in mainstream topic discovery model,this paper proposes an improved microblog hot topic discovery method(BiLSTM-HBBTM)based on the bursty biterm topic model(BBTM),in order to get better performances in microblog hot topic mining.[Research method]First,microblog propagation value,H index of term and bursty probability of biterm are used to select characteristics.The characteristics selection is carried out from the document level and the word level to solve the problem of data sparsity and high dimension.Second,through the Bi-directional long-short term memory(BiLSTM)training,the relationship between words,combined with the inverse document frequency of words as the prior knowledge of biterms,the relationship between words is considered and solve the problem of ignoring the relationship between words.Third,a density based method is used to select optimal number of topics for the BBTM model,which solves the problem that the traditional topic model needs to manually specify the number of topics.Finally,the actual datasets are used to verify the accuracy of hot topic discovery,topic quality and consistency.[Research conclusion]The experiment shows that BiLSTM-HBBTM is better than the contrast model in a variety of evaluation indicators,and the experimental results have verified the effectiveness and feasibility of the model.
作者
向卓元
吴玉
陈浩
张芙玮
Xiang Zhuoyuan;Wu Yu;Chen Hao;Zhang Fuwei(School of Information and Safty Engineering, Zhongnan University of Economics and Law,Wuhan 430073)
出处
《情报杂志》
CSSCI
北大核心
2022年第1期104-112,共9页
Journal of Intelligence
基金
国家自然科学基金面上项目“面向跨语言观点摘要的领域知识表示与融合模型研究”(编号:71974202)研究成果之一。