摘要
在信息资讯发达的今天,短信已成为日常生活中每天都要接触的东西,但同时垃圾短信也常常困扰着人们,为此,进行垃圾短信过滤成为了一个必要的问题.与普通的分类问题不同,垃圾短信的表现形式为文本数据,且不同的垃圾短信的文本特征各不相同,提高了分类识别的难度.其次,垃圾短信总体来说在所有短信中的占比并不高,因此,常常伴随着类别不平衡,由此样本不平衡带来的信息不充分也提高了识别的难度.针对这些问题,文章首先采取TF-IDF方法进行特征提取,把文本数据转化成向量的形式,然后在经过转换的数据应用欠采样技术获得若干个类别平衡的训练样本,每个样本分别采取朴素贝叶斯、决策树和支持向量机等分类模型进行训练,得到相应基分类器,最后再利用集成学习的思想把基分类器进行模型融合,得到一个分类性能较高的垃圾短信识别模型.
In the information era,SMS has become a thing we have to meet daily.It has irreplaceable status in social communication and information exchange,but at the same time,spam messages often disturbe us.Thus,spam message filtering has become a necessary problem.The spam message itself is presented as a text,and the text characteristics of the spam messages are not the same,improving the difficulty of the classification.Secondly,the proportion of spam message is samll while the majority of message is normal,so there is a category imbalance problem,which also enhances the difficulty of identification.In order to solve these problems,this paper first takes the TF-IDF method to extract the feature,then obtains several categories of balanced samples through the Random Under-sampling technique,each sample adopts different classification model to train the base classifier,and finally uses the Ensemble Learning method to mix base classifiers.At last,a high classification performance model of spam short message recognition is obtained.
作者
熊健
邹东兴
XIONG Jian;ZOU Dong-xing(School of Economics and Statistics,Guangzhou University,Guangzhou 510006,China)
出处
《广州大学学报(自然科学版)》
CAS
2018年第5期1-7,共7页
Journal of Guangzhou University:Natural Science Edition