摘要
以微博用户的性别分类为目的,为提高分类的准确性,尝试多种模型及模型融合的方法进行对比研究。所用数据集来源于首届"微众杯"的技术测评。首先结合中文微博文本数据的特点,基于微博用户粒度对数据进行预处理,然后分别使用Logistic Regression、Random Forest、SVM等模型进行分类,其间调整模型参数、类型及核函数分别做对比,最后将训练样本分成若干批量,通过不同模型和相同模型分别进行融合分类。实验结果表明,使用多个SVM模型融合的方法对微博用户性别分类准确率较高。
In this paper,the purpose of the study of the gender classification of micro-blog users is to improve the accuracy of classification,and to compare the methods of multiple models and model fusion.The dataset is derived from the technical evaluation of the first "micro-cup".Firstly,based on the characteristics of Chinese micro-blog text,the data is preprocessed based on the granularity of micro-blog user.Then,Logistic Regression,Random Forest,SVM and other models are used to classify them.Different model parameters,types and nuclear functions are compared among them.Finally,the training samples are divided into several mini-batches,and the fusion classification is carried out through different models and the same model.Experimental results show that the method of fusion of multiple SVM models has high accuracy in the gender classification of micro-blog users.
出处
《辽宁工业大学学报(自然科学版)》
2018年第1期13-18,共6页
Journal of Liaoning University of Technology(Natural Science Edition)
关键词
微博用户
性别
分类
模型融合
micro-blog user
gender
classification
model fusion