摘要
社交网络中存在大量营销、招聘等垃圾信息以及无实质内容的短文,为话题建模工作带来很多干扰,更严重影响社交网络方面的学术研究及商业应用。因此,该文提出了一种结合支持向量机与k近邻模型(pSVM-kNN)的半监督话题噪声过滤方法。该方法融合了SVM和kNN算法,在SVM计算得到超平面的基础上使用kNN算法在局部范围内迭代寻找分类超平面的最优解;同时为减少误分类发生,分别在SVM和kNN阶段引入惩罚代价和比例权重,以提高噪声过滤的效果。通过选取新浪微博中不同大小的数据集进行实验与其他方法进行比较,结果表明:该方法只利用了少量的标注样本进行训练,在准确率、召回率和F值方面均优于其他的对比方法。
Social networking feeds often include much spam that includes marketing,recruitment or short articles without real content which negatively affect the user interest.The spam also seriously affects academic research and business applications.This paper presents an algorithm based on the pSVM-kNN model for filtering Chinese microblogging text noise to reduce the spam.This method combines the SVM and kNN algorithms. The kNN algorithm iteratively finds the optimal solution of the classification hyperplane in the local scope on the SVM computing hyperplane.Penalty costs and proportional weights are introduced into the SVM and kNN stages to improve the noise filtering and reduce misclassification.Tests on various size of real Sina Weibo datasets demonstrate that the precision and recall of this algorithm are significantly better than other methods with a remarkable improvement of the F-measure.
作者
屠守中
杨婧
赵林
朱小燕
TU Shouzhong;YANG Jing;ZHAO Lin;ZHU Xiaoyan(Department of Computer Science and Technology,Tsinghua University,Beijing 100084,China;CAS Key Laboratory of Network Data Science&Technology,Institute of Computing Technology,Chinese Academy of Sciences,Beijing 100190,China;State Key Laboratory of Information Security,Institute of Information Engineering,Chinese Academy of Sciences,Beijing 100093,China)
出处
《清华大学学报(自然科学版)》
EI
CAS
CSCD
北大核心
2019年第3期178-185,共8页
Journal of Tsinghua University(Science and Technology)
基金
国家自然科学基金资助项目(61332007
61303049)
关键词
社交网络
支持向量机
K近邻
噪声过滤
惩罚代价
social networks
support vector machine
k-nearest neighbor
noise filtering
penalty cost