期刊文献+

基于SSk-means聚类指导的邮件SVM分类学习算法

Email SVM classification based on SSk-means clustering algorithm
下载PDF
导出
摘要 邮件分类学习算法需要大量标注样本,人工标记工作费时费力。此外邮件内容因其表达方式上的特殊性,其特征空间一般是稀疏的,这种稀疏性会影响分类处理的效果。为了节省标记训练集的时间和精力,同时更好地处理稀疏的邮件数据,引入自适应选择最佳密度半径球形k-means聚类(SSk-means)算法,作为支持向量机(SVM)的前端处理,将训练集扩展后再送入SVM分类器。实验结果与性能比较表明,在训练集只有极少量标记邮件和一定量未标记邮件的情况下,该分类算法的性能较一般SVM有很大的提高。 Because the email's expression and spelling mode are particular, the email feature-space is sparse commonly, and sometimes characters's distribution are skewed. All of the factors above can influence the performance of classification. And it's also because labeling training set is a time-wasting, energy wasting task, so using spherical k-means clustering algorithm based on self-adoptively selecting density radius (SSk-means) before SVM (support vector machines) classifier. When the training set has only a little labeled data, this method is more accurate and has better performance than standard SVM.
出处 《计算机工程与设计》 CSCD 北大核心 2009年第2期385-387,391,共4页 Computer Engineering and Design
关键词 邮件分类算法 球形k-means算法 标记样本 自适应选择最佳密度半径 支持向量机 email classification algorithm Sk-means algorithm labeled data self-adoptively selecting density radius SVM
  • 相关文献

参考文献10

  • 1Drucker H,Vapnik V.Support vector machines for spam categorization [J]. IEEE Transactions on Neural Networks, 1999,10 (5): 1048-1054.
  • 2Vapnik V.The nature of statistical leaming theory[M].New York: Springer, 1995.
  • 3Baker L D,McCallum A K.Distributional clustering of words for text classification[C].Melboume,AU:Proceedings of SIGIR98,21 st ACM International Conference on Research and Development in Information Retrieval.New York,US:ACM Press,1998.
  • 4Slonim N,Tishby N.The power of word clustering for text classification [C]. Proceedings of the European Colloquium on IR Research,2001.
  • 5Takamura H,Matsumoto Y.Two-dimensional clustering for text categorization[C]. Taipei,Taiwan:Proceedings of the Sixth Conference on Natural Language Learning (CoNLL-2002), 2002: 29-35.
  • 6Raskutti B,Ferr H,Kowalczyk A.Combining clustering and cotraining to enhance text classification using unlabeled data[C]. Canada:Proceedings of SIGKDD,2002.
  • 7Inderjit S Dhillon, Dharmendra S Modha. Concept decomposition for large sparse text data using clustering [J]. Machine Learning,2001,42(1): 143-175.
  • 8Yang Xinhua,Yu Kuan,Deng Wu.A k-means clustering algorithm based on self-adoptively selecting density radius [C]. IJCSNS International Journal of Computer Science and Network Security,2006.
  • 9Zeng Huajun,Wang Xuanhui,Chen Zheng,et al.CBC:Clustering based text classification requiring minimal labeled data [C]. ICDM, Third IEEE International Conference on Data Mining (ICDM'03),2003:443 -450.
  • 10Klimt B,Yang Y.The enron corpus:A new dataset for email classification research[C].Pisa,Italy:15th European Conference on Machine Learning,2004:217-226.

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部