摘要
传统的话题识别方法实现对新闻媒体信息流中新话题的自动识别,主要针对长文本信息,不适用于数据稀疏的微博客。为此,本文提出一种以用户语言为基础的话题词库,构建主题词共现图进行微博客话题识别。在此基础上,分别用Clauset算法及PageRank算法进行了模块化的聚类。前者从内容视角发现了不同的兴趣簇群,其社区结构较为扁平化;后者从人的视角发现了不同的兴趣簇群,群意见领袖均为现实社会的权威人物,其社区结构呈现较明显的层级性。
The traditional topic detection method can realize the automatic identification of the new topic in the news media information flow, which is mainly aimed at the long text information and is not suitable for data sparse microblogs. Therefore, this paper proposes a user-language-based topic thesaurus to build the keywords co-occurrence diagrams of microblog topic identification. On this basis, the Clauset algorithm and PageRank algorithm are used to carry out the modular clustering. Concerning the Clauset, different interest groups are identified from the perspective of the content, and their community structure is relatively flat; As for the PageRank, different interest clusters are found from the perspective of people, the opinion leaders of the clusters are the authority figures of social reality, and their community stnlcture show a more significant level of resistance.
出处
《情报杂志》
CSSCI
北大核心
2015年第11期183-187,共5页
Journal of Intelligence