摘要
[目的/意义]针对短文本词语特征向量稀疏的问题,提出利用突然爆发现象和词语共现概率现象进行热点主题探测研究的新视角。[方法/过程]以此为目标建立了一个分析框架,以词频均值波动模型发现短文本中的热点词,以概率语言模型发现主题词,再对二者结果集合进行相似度计算实现热点主题的探测与展示。[结果/结论]通过对较高热度噪声词的过滤以及热点事件词的发现实现热点主题的探测。通过对比谷歌趋势的结果,准确率达到82.67%,证明模型有效。本研究对短文本热点主题探测的理论和实践研究具有一定参考价值。
[Purpose/Significance]Aiming at the problem of sparse feature vectors of short texts, the research proposes a new research perspective for hot topics detection by using sudden outbreak phenomenon and word co-occurrence probability phenomenon.[Method/Process]We set up an analytical framework to find hot words in short texts by word frequency mean fluctuation model and to detect topic words by probabilistic language model, and then calculated the similarity between their results to display the hot topics.[Result/Conclusion]The method detects hot topics by filtering high thermal noise words and discovering hot event words. By comparing the results of Google trend, the accuracy rate is 82.67%, which proves that the model is effective. The research is helpful for actual business practice or theory study of short texts’ hot topics detection.
作者
徐敏
李广建
Xu Min;Li Guangjian(Department of Information Management,Peking University,Beijing 100871)
出处
《情报杂志》
CSSCI
北大核心
2019年第6期152-158,共7页
Journal of Intelligence
基金
国家社会科学基金重点项目“大数据环境下的计算型情报分析方法与技术研究”(编号:14ATQ005)研究成果之一
国家社会科学基金重大项目“大数据时代知识融合的体系架构、实现模式及实证研究”(编号:15ZDB129)的阶段性成果之一
关键词
短文本
热点主题探测
词频均值波动模型
概率语言模型
short texts
hot topics detection
word frequency mean fluctuation model
probabilistic language model