基于SSk-means聚类指导的邮件SVM分类学习算法

Email SVM classification based on SSk-means clustering algorithm

下载PDF

导出

摘要邮件分类学习算法需要大量标注样本,人工标记工作费时费力。此外邮件内容因其表达方式上的特殊性,其特征空间一般是稀疏的,这种稀疏性会影响分类处理的效果。为了节省标记训练集的时间和精力,同时更好地处理稀疏的邮件数据,引入自适应选择最佳密度半径球形k-means聚类(SSk-means)算法,作为支持向量机(SVM)的前端处理,将训练集扩展后再送入SVM分类器。实验结果与性能比较表明,在训练集只有极少量标记邮件和一定量未标记邮件的情况下,该分类算法的性能较一般SVM有很大的提高。 Because the email＇s expression and spelling mode are particular, the email feature-space is sparse commonly, and sometimes characters＇s distribution are skewed. All of the factors above can influence the performance of classification. And it＇s also because labeling training set is a time-wasting, energy wasting task, so using spherical k-means clustering algorithm based on self-adoptively selecting density radius （SSk-means） before SVM （support vector machines） classifier. When the training set has only a little labeled data, this method is more accurate and has better performance than standard SVM.

作者张曼李弼程林琛郭志刚

机构地区信息工程大学信息工程学院

出处《计算机工程与设计》 CSCD 北大核心 2009年第2期385-387,391,共4页 Computer Engineering and Design

关键词邮件分类算法球形k-means算法标记样本自适应选择最佳密度半径支持向量机 email classification algorithm Sk-means algorithm labeled data self-adoptively selecting density radius SVM

分类号 TP391 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献10

1Drucker H,Vapnik V.Support vector machines for spam categorization [J]. IEEE Transactions on Neural Networks, 1999,10 (5): 1048-1054.
2Vapnik V.The nature of statistical leaming theory[M].New York: Springer, 1995.
3Baker L D,McCallum A K.Distributional clustering of words for text classification[C].Melboume,AU:Proceedings of SIGIR98,21 st ACM International Conference on Research and Development in Information Retrieval.New York,US:ACM Press,1998.
4Slonim N,Tishby N.The power of word clustering for text classification [C]. Proceedings of the European Colloquium on IR Research,2001.
5Takamura H,Matsumoto Y.Two-dimensional clustering for text categorization[C]. Taipei,Taiwan:Proceedings of the Sixth Conference on Natural Language Learning (CoNLL-2002), 2002: 29-35.
6Raskutti B,Ferr H,Kowalczyk A.Combining clustering and cotraining to enhance text classification using unlabeled data[C]. Canada:Proceedings of SIGKDD,2002.
7Inderjit S Dhillon, Dharmendra S Modha. Concept decomposition for large sparse text data using clustering [J]. Machine Learning,2001,42(1): 143-175.
8Yang Xinhua,Yu Kuan,Deng Wu.A k-means clustering algorithm based on self-adoptively selecting density radius [C]. IJCSNS International Journal of Computer Science and Network Security,2006.
9Zeng Huajun,Wang Xuanhui,Chen Zheng,et al.CBC:Clustering based text classification requiring minimal labeled data [C]. ICDM, Third IEEE International Conference on Data Mining (ICDM'03),2003:443 -450.
10Klimt B,Yang Y.The enron corpus:A new dataset for email classification research[C].Pisa,Italy:15th European Conference on Machine Learning,2004:217-226.

1王素芳,王小伟.人工免疫系统中参数对算法性能的影响分析[J].焦作师范高等专科学校学报,2008,24(4):72-75.
2陈立.基于加权子图和支持向量机相融合的邮件分类算法[J].内蒙古师范大学学报（自然科学汉文版）,2015,44(5):647-651. 被引量：1
3杨鑫华,于宽.基于密度半径自适应选择的K-均值聚类算法[J].大连交通大学学报,2007,28(1):41-44. 被引量：2
4张志昌,张宇,刘挺,李生.基于线索词识别和训练集扩展的中文问题分类[J].高技术通讯,2009,19(2):111-118. 被引量：6

计算机工程与设计

2009年第2期

浏览历史

内容加载中请稍等...

基于SSk-means聚类指导的邮件SVM分类学习算法

参考文献10

相关作者

相关机构

相关主题

浏览历史