摘要
旨在实践一种方法,能从大量散乱的微博语料中快速抽取热点词语并从中形成话题。首先参考文本串的词频、内部聚合度、粘联度这三个统计特征判断文本串是否成词,直接从样本语料中抽取出高频词。然后根据这些高频词在不同时间窗口的出现频率筛选出实时热词,最后利用词共现确定热词间的关联度,将热词聚类形成热点话题。实验证明,该算法简单易行,在话题发现中取得了良好的效果。
This paper aims to extract valuable information from massive fragmented content and feed back to the user in a concise form. Firstly, considering three statistical characteristics: word frequency of text string, internal degree of coupling, the external degree of flexi- bility, we extract high-frequency words from micro-blog corpus, then filter outreal-time hot words according to the frequency of occur- rence of these high-frequency words in different time windows, and finally use the word co-occurrence to determine the hot words correla- tion to get a hot topic. Experimental results show that the algorithm is simple and available, and achieved good results on the topic detec- tion.
出处
《情报杂志》
CSSCI
北大核心
2015年第6期109-113,157,共6页
Journal of Intelligence
基金
国家自然科学基金项目"面向海量数据语义标注众包的任务管理方法研究"(编号:71401096)
教育部人文社会科学基金资助项目"面向用户兴趣基于本体的网络舆情研判体系研究-以论坛为例"(编号:10YJC860010)
山东省高校人文社会科学研究计划项目"云计算可持续发展的关键影响因素及对策研究"(编号:J13WG16)
关键词
微博
微博热词
话题发现
词共现
micro-blogging micro-blogging hot word topic detection word co-occurrence