摘要
针对微博信息的稀疏性和时效性,研究了微博网站中面向主题的权威信息搜索问题。通过提取微博隐主题方法,缓解了微博文本信息数据稀疏性的问题;通过两阶段聚类算法,将微博网站中的信息按主题进行聚类,加快了微博信息搜索时间;提出了一种微博网站中面向主题权威信息的排序模型,该排序模型结合KLdivergence语言模型的伪相关反馈技术和时间因子来对微博信息进行排序,并利用第一次检索到的首页信息中转发次数较高的微博信息进行查询扩展。在新浪微博的真实数据集上的实验结果表明,提出的隐主题模型可以较好地解决微博数据稀疏性问题,并且权威信息排序模型相对于其他排序算法,在微博网站中进行信息搜索有更好的效果。
Aiming at the inherent sparsity and strong timeliness about microblog, this paper studies the retrieval problem of topic-oriented authoritative information in microblog site. Firstly, this paper presents the method extracting the implicit theme of microblog, which can effectively ease sparsity problem about microblog short text data. Furthermore, this paper uses a two-stage clustering algorithm into microblog site to classify information by topics, which can speed up searching time. Finally, this paper proposes an efficient rank model in microblog site, which combines pseudo relevance feedback technology of KL-divergence language model and time factor for rank, and uses the first-retrieved microblog information of home page with high retweeting numbers to conduct query expansion. The experimental results on real datasets from Sina microblog demonstrate that the proposed implicit theme model can considerably solve data sparseness problem, and the rank model of authoritative information has better perfor- mance in terms of real-time information search.
出处
《计算机科学与探索》
CSCD
2013年第12期1135-1145,共11页
Journal of Frontiers of Computer Science and Technology
基金
国家自然科学基金青年科学基金
北京市自然科学基金~~
关键词
微博网站
隐主题
聚类
权威信息
microblog site
implicit theme
clustering
authoritative information