摘要
微博空间存在大量的广告内容,这些信息严重影响着普通用户的用户体验和相关的研究工作。现有研究多使用支持向量机(SVM)或随机森林等分类算法对广告微博进行处理,然而分类方法中人工标注大数据量训练集存在困难,因此提出基于聚类分析的微博广告发布者识别方法:对于用户维度,针对微博广告发布者通过发布大量普通微博来稀释其广告内容的现象,提出核心微博的概念,通过提取核心微博主题及其对应的微博序列,计算用户特征和对应微博的文本特征,并使用聚类算法对特征进行聚类,从而识别微博广告发布者。实验结果显示,所提方法准确率为92%,召回率为97%,F值为95%,证明所提方法在广告内容被人为稀释的情况下能准确地识别微博广告发布者,可以为微博垃圾信息识别、清理等工作提供理论支持和实用方法。
There is a large amount of advertising content in micro-blog space, which seriously affects user experience and related research work. Much of existing research on micro-blog process uses classification algorithm such as Support Vector Machine( SVM) and random forest algorithm. However, it is difficult to classify a large volume of data in the classification method manually. A micro-blog advertisement publisher identification method based on clustering analysis was proposed. For user dimension, a concept of core micro-blog was put forward to deal with the phenomenon that ordinary micro-blogs were posted to dilute advertising content. Then the extracted main themes of each user and corresponding micro-blog sequences could be used to calculate user characteristics as well as the text characteristics. After that, a clustering algorithm was used to cluster the features and identify the micro-blog advertisers. The experiment result shows that the precision is 93%, the recall is 97%, and the F value is 95%, which proves that the proposed method can accurately identify the micro-blog advertisement publisher under the condition that the content of the advertisement is artificially diluted. It provides theoretical support and practical methods for the recognition and cleaning work of micro-blog spam information.
作者
赵星宇
赵志宏
王业沛
陈松宇
ZHAO Xingyu;ZHAO Zhihong;WANG Yepei;CHEN Songyu(Software Institute,Nanjing University,Nanjing Jiangsu 210093,China)
出处
《计算机应用》
CSCD
北大核心
2018年第5期1267-1271,共5页
journal of Computer Applications
基金
江苏省产学研前瞻性联合研究项目(BY2015069-03)~~
关键词
微博广告
基于密度的空间聚类
文本过滤
特征提取
micro-blog advertising
Density-Based Spatial Clustering of Applications with Noise (DBSCAN)
text filtering
feature extraction