摘要
针对微博内容的稀疏、高维等特征,提出了一种基于词项关联关系的模糊C均值聚类算法。该算法通过挖掘词项间语义的关联程度,将文本特征最大化,并用提前标注部分同类文本的方式来指导模糊C均值算法在初始聚类中心上的选择,从而达到优化效果。实验结果表明,该算法一定程度上克服了微博本身存在的数据稀疏性问题,能高效地进行微博聚类。
Because of feature of quick and easy to operate and strong interactive, micro-blog text has become the bridge and link of modern information communication.At the same time, it is likely to become the hotbed of false information since the super-vision system is not perfect.Therefore, it is necessary to strengthen the micro-blog public opinion supervision, analysis and early warning.The paper puts forward a method based on term correlation relationship of fuzzy c-means clustering algorithm, which ai-ming at the features of sparse and high dimensional.The algorithm maximizes the feature of the text by excavating the relationship between different items.Meanwhile, by means of tagging part of the same text to guide the algorithm of fuzzy c-means to choice it’ s clustering center, so as to achieve the effect of optimization.Experimental results show that the algorithm can overcome the sparse of micro-blog to a certain degree, and can clustering efficiently .
出处
《安庆师范学院学报(自然科学版)》
2016年第3期68-72,共5页
Journal of Anqing Teachers College(Natural Science Edition)
基金
安徽省高等学校自然科学基金(KJ2013A177)
关键词
微博
词项关联关系
模糊聚类
micro-blog
term correlation relationship
fuzzy clustering