摘要
研究并实现了kNN算法的手机短信客户端分类系统,从自建的短信语料库中提取到正常短信和垃圾短信两个特征向量集,通过预处理、降维和去除词频过小的特征项,使特征向量集可最大程度的载有该类短信的特征项。短信语料库分成比对库和测试库两部分。研究发现,比对库的短信数量n取600时分类效果最好,过小则降低短信的识别率,过大则提升分类时间复杂度,近邻数k取25时效果最优。同时研究了当k条短信选取时的概率差在1%~2%时,短信类别确定时的数量差在5到15之间时,效果最优。遵循保证正常短信的通过率的同时加大垃圾短信识别率的原则,kNN算法手机短信客户端分类系统的最终参数n取600,k取25,概率差取1.5%,数量差取9,可使得正常短信和垃圾短信识别率最高达到97.3%和89%。
This paper studied and realized the SMS client classification system based on kNN algorithm and extracted two feature vectors set of the normal and spam SMS from the self-built SMS corpus, and made the feature vectors set get the feature item of the SMS to the maximum extent through the pretreatment, reducing dimension and removing the smaller frequency feature items. The study showed that the classification effect was the best when n was took 600,the SMS recognition rate reduced when n was too small, the classification time complexity enhanced when n too large, the optimum was neighbor number k to be took 25. At the meantime,the optimum effect was performed when the probability discrepancy of k SMS between 1%and 2%, and number discrepancy of which between 5 and 15. The recognition rate of normal and spam SMS was up to 97.3%and 89%when the final classification system parameter n was took 600, k was took 25,probability difference 1.5%,discrepancy number was took 9 to ensure the better normal SMS pass rate and spam SMS recognition rate.
出处
《山东农业大学学报(自然科学版)》
CSCD
北大核心
2014年第2期216-222,共7页
Journal of Shandong Agricultural University:Natural Science Edition
基金
安徽省高等学校省级自然科学研究项目(KJ2012B181)
安徽省高等学校省级自然科学研究项目(KJ2012B183)
关键词
短信分类
KNN算法
特征向量集
向量空间模型
SMS classification
k-nearest neighbor algorithm
feature vectors set
vector space model