摘要
针对个人微博聚类时缺乏考虑文本语义特征的问题,提出一种结合语义特征的个人微博聚类方法.该方法充分考虑了微博文本的语义特征,可将意义相关的微博更为准确地聚类.其要点如下:首先,利用随机游走算法产生每个词汇的语义标签及其概率,游走图基于知网的语义关系图产生;其次,利用排列算法将两篇微博中词汇的各个语义项进行相似度求解,得到意思集合;最后,利用余弦相似度计算两条微博的语义相关度,并将大于相似度阈值的聚在一起.为了提高算法效能,在计算微博的相似度时进行了分段和优化.实验表明,利用语义特征得到的聚类结果,F-度量值较利用词共现和word2vec聚类方法有明显地提高.
For the problem that semantic features are less covered by existing individual microblog clustering algorithms,a clustering method integrating semantic features of microblog texts is proposed. On the basis of this relevant microblog can be clustered better. Its main points are as follows:Firstly, random surfer model generates semantic signature and probability of each lexical item based on HowNet semantic relation graph;Secondly, alignment algorithm produces sets of senses by calculating semantic signature similarity of lexical item in two microblogs ; Finally, semantic metric is calculated cosine similarity, if the metric is greater than a certain threshold, the corresponding microblogs are clustered. To improve algorithm performance, segmentation by time and optimization are adopted to calculate microblog texts similarity. Experimental results show that the proposed method outperforms word co-occurrence and word2vec method in F-measure.
出处
《小型微型计算机系统》
CSCD
北大核心
2017年第7期1543-1548,共6页
Journal of Chinese Computer Systems
基金
国家自然科学基金项目(61163025)资助
内蒙古自治区自然科学基金项目(2015MS0621)资助
关键词
个人微博
聚类
语义
知网
individual microblog
clustering
semantic
HowNet