摘要
在基于链接的概率隐含语义分析的基础上提出一种融合文本链接的增量方法进行主题建模。首先在原有网页集上进行主题建模;然后随着网页的结构和内容动态变化,利用一种合理的更新机制更新模型参数,从而高效快速地处理在线网页流的动态变化。此外,提出一个自适应非对称学习方法融合文本与链接模态的隐含主题。对于每个网页,它在两种模态上的主题分布通过加权进行融合,而权值由该网页的特征词分布的熵值确定。由于融合之后的概率结构合理地关联了链接模态和文本模态的信息,故能得到很好的建模效果。两种类型的数据集上的实验结果显示该算法可以有效地节省时间,并对网页分类有较大性能的提高,此外还提供了由本文模型生成的主题显示结果。
This paper proposed an incremental algorithm integrating both content and link for topic modeling based on link-PLSA.Firstly,it performed topic modeling on the initial dataset.And then presented a reasonable technique of updating parameter of model to effectively integrate the newly arriving documents and linked into the original model.Furthermore,it proposed an adaptive asymmetric learning approach to fuse the latent topics of both content and link modality.For each webpage,it fused the distribution over topics of each model by multiplying different weights,which determined by the entropy of the distribution of words.A better topic modeling could be achieved as the probabilistic structure associates content and link modalities properly.Empirical experiments on two data sets with different link structure show that the approach is time saving and indicate that the model leads to systematic improvements in the quality of classification.Besides,this paper presented some interesting visualizations generated by the model.
出处
《计算机应用研究》
CSCD
北大核心
2012年第4期1289-1293,共5页
Application Research of Computers
基金
西北师范大学青年教师科研能力提升计划资助项目(NWNU-LKQN-10-1
SKQNGG10018)