摘要
【目的】通过大规模文本聚类技术进行话题检测,并自动拣选优质话题。【方法】以新浪微博上与饮食相关的微博内容为数据源,结合文本聚类与深度学习知识进行话题检测。通过匹配微博发布的月份,将微博划分为四季微博;使用向量空间模型和文本聚类方法,对不同季节的微博进行话题检测,获得候选话题;结合深度学习知识,提出主题覆盖率概念,用以自动评价话题质量,去除低质量话题。【结果】基于主题覆盖率的话题筛选结果符合人工拣选预期,抽取获得主题覆盖率高于0.5的优质话题。【局限】话题检测质量的评价主要以定性评价为主。【结论】通过计算主题覆盖率来自动选择优质话题,该方法效率高,通用性强,获得的话题便于理解,较好地揭示了四季中饮食微博的话题分布。
[Objective] This study aims to identify microblog post topics, and then automatically extract high quality ones with the help of text clustering techniques. [Methods] We collected food related microblog posts from Sina Weibo as raw data, then applied text clustering and deep learning techniques to detect the target topics. First, we categorized the microblog posts by the four seasons in accordance with their publishing dates. Second, we created a vector space model and used text clustering method to retrieve candidate topics. Finally, we automatically identified the quality topics with deep learning technology. [Results] We automatically identified the high quality topics manually found by researchers, and their topic coverage values were all higher than 0.5. [Limitations] We decided the topic quality based on qualitative data. [Conclusions] The proposed method could extract high quality topics effectively. The retrieved topics reflect the distribution of food related microblog posts in the four seasons.
出处
《现代图书情报技术》
CSSCI
2016年第10期70-80,共11页
New Technology of Library and Information Service
基金
国家社会科学基金项目"在线社交网络中基于用户的知识组织模式研究"(项目编号:14BTQ033)
国家社会科学基金重点项目"大数据环境下社会舆情与决策支持方法体系研究"(项目编号:14AZD084)
江苏省普通高校研究生科研创新(实践)计划项目"基于社交媒体的多粒度电影评论挖掘研究"(项目编号:SJLX15_0166)的研究成果之一