摘要
邮件分类学习算法需要大量标注样本,人工标记工作费时费力。此外邮件内容因其表达方式上的特殊性,其特征空间一般是稀疏的,这种稀疏性会影响分类处理的效果。为了节省标记训练集的时间和精力,同时更好地处理稀疏的邮件数据,引入自适应选择最佳密度半径球形k-means聚类(SSk-means)算法,作为支持向量机(SVM)的前端处理,将训练集扩展后再送入SVM分类器。实验结果与性能比较表明,在训练集只有极少量标记邮件和一定量未标记邮件的情况下,该分类算法的性能较一般SVM有很大的提高。
Because the email's expression and spelling mode are particular, the email feature-space is sparse commonly, and sometimes characters's distribution are skewed. All of the factors above can influence the performance of classification. And it's also because labeling training set is a time-wasting, energy wasting task, so using spherical k-means clustering algorithm based on self-adoptively selecting density radius (SSk-means) before SVM (support vector machines) classifier. When the training set has only a little labeled data, this method is more accurate and has better performance than standard SVM.
出处
《计算机工程与设计》
CSCD
北大核心
2009年第2期385-387,391,共4页
Computer Engineering and Design