摘要
随着互联网的快速发展,网络舆情对社会的影响与日俱增.对互联网上网民产生的海量文本内容进行快速准确的分析,以及在此基础上捕捉网络舆情,并对其发展趋势进行预测,对社会经济发展无疑具有重要意义.为此,本文研究了论坛中帖子的热度预测问题,针对现有算法在度量帖子内容相似性时仅仅考虑字面上的相似性,未涉及语义层面,并且未考虑发帖人的特定喜好等不足,提出了LDA(潜在狄利克雷分配)与KNN(K近邻)相结合的热度预测算法,该算法利用LDA挖掘帖子表面文本隐藏的主题信息和用户感兴趣的主题信息,在概念层面上度量帖子之间的相似性,在此基础上基于KNN算法对帖子的热度进行预测.在两组数据集的实验结果表明,所提出的算法在预测准确率方面明显优于相关工作中的方法,平均准确率分别提高了4.34%和2.52%.
With the rapid development of the Internet, the impact of web public opinions on society is growing every day. It is thus undoubtedly important for social and economic development to capture web public opinions and predict their development trends based on an efficient and accurate analysis of massive textual information generated by Internet users. Therefore, the problem on the prediction of forum post's hotness is studied in this paper. Aimed at the shortages of the existing algorithms that they sim- ply use literal similarity to approximately measure the content similarity between posts without taking into account the semantic aspect of posts or the particular interests of post authors, an algorithm combi- ning Latent Dirichlet Allocation (LDA) and K-Nearest Neighbors (KNN) for hotness prediction is pro- posed in this paper. The new algorithm predicts the hotness of a post based on the hotness-related infor- mation of the posts and the authors by using LDA to mine the topical information hidden in the texts of posts and applying KNN for the obtainment of similar authors and similar posts at the topical level. The experimental results on two datasets show that the proposed method performs significantly better than another method proposed recently, and that the average precisions of prediction on the datasets have been improved respectively by 4.32% and by 2.52%.
出处
《四川大学学报(自然科学版)》
CAS
CSCD
北大核心
2014年第3期467-473,共7页
Journal of Sichuan University(Natural Science Edition)
基金
浙江省自然科学基金(LY12F02010)
四川大学青年基金(2011SCU11017)
关键词
网络舆情
潜在狄利克雷分配
K近邻
帖子热度预测
相似性
Web public opinion
Latent Dirichlet Allocation
K-nearest neighbor algorithm
Prediction of post' s hotness
Similarity