摘要
个人微博在事件提取上大多都是运用文本进行相似度计算最终达到聚类结果,而没有充分的考虑到微博特征。针对微博标签、URL、时间等特征,提出一种基于微博特征的事件提取算法。该算法针对微博的特征进行TF-IDF的改进,并加入标签相似度,URL相似度,进行综合相似度计算,最后按时间先分段后合并的改进K-means聚类方法得出事件提取结果。实验结果表明,基于微博特征的事件提取算法对微博关键字提取和事件提取的精确度有明显的提高。
Individual microblogs,in regard to events extraction,mostly use their texts to calculate the similarity to finally achieve the clustering results,but the microblogging features are not fully taken into consideration. Aiming at the characteristics of microblogging hashtag,URL and time,this paper puts forward a microblogging characteristic-based events extraction algorithm. The algorithm makes the TF-IDF improvement against microblogging characteristics,and adds hashtag similarity and URL similarity to carry out the comprehensive similarity calculation. Finally,it uses the improved K-means clustering method,that segments first and merges afterwards according to the time,to get the events extraction results. Experimental results show that the microblogging characteristics-based events extraction algorithm achieves obvious improvement in accuracy of microblogging keywords extraction and events extraction.
出处
《计算机应用与软件》
CSCD
2016年第7期47-51,共5页
Computer Applications and Software
基金
国家自然科学基金项目(61163025)
关键词
微博特点
事件提取
综合相似度
Microblogging characteristic
Events extraction
Comprehensive similarity