期刊文献+

基于支持向量机的垃圾邮件过滤方法 被引量:7

Spam Filter Approach Based on Support Vector Machine
下载PDF
导出
摘要 针对中英文混合垃圾邮件过滤问题,提出一种基于支持向量机(SVM)的过滤方法和融合多种分类特征的框架。通过改进SVM中线性核的表示方式,解决存储空间和计算量问题。通过领域术语自动抽取技术,增强垃圾邮件过滤的语义单元识别能力,提高垃圾邮件分类性能。在跨语言大规模语料库上的实验表明,采用SVM比采用Good-Turing算法平滑的朴素贝叶斯模型泛化性能提高了6.13%,分类精度比最大熵模型提高了8.18%。 This paper presents a spam filter approach based on Support Vector Machine(SVM) to deal with cross language E-mail including Chinese and English, which provides the ability of integrating more statistical information. It optimizes the representation of linear kernel to improve time complexity and storage complexity, and adopts domain term extraction to improve the ability of semantic unit recognition and the performance of spam filter. Experiments on large-scale cross language corpora show that SVM-based approach increases the precision by 6.13% compared to Naive Bayes which is smoothed by Good-Turing, and increases classification accuracy by 8.18% compared to maximum entropy model.
作者 王祖辉 姜维
出处 《计算机工程》 CAS CSCD 北大核心 2009年第13期188-189,207,共3页 Computer Engineering
基金 国家自然科学基金资助项目(70801022)
关键词 垃圾邮件过滤 支持向量机 领域术语抽取 spam filter Support Vector Machine(SVM) domain term extraction
  • 相关文献

参考文献5

  • 1Gim(e)nez J,M(a)rquez L.SVMTool:A General POS Tagger Generator Based on Support Vector Machines[C]//Proceedings of the 4th International Conference on Language Resources and Evaluation.Lisbon,Portugal:[s.n.],2004.
  • 2Pang Xiuli,Feng Yuqiang,Jiang Wei.A Chinese Anti-spam Filter Approach Based on Support Vector Machine[C]//Proceedings of International Conference on Management Science & Engineering.[S.l.]:IEEE Press,2007.
  • 3Joachims T.Text Categorization with Support Vector Machines:Learning with Many Relevant Features[C]//Proc.of the 10th European Conference on Machine Learning.Chemnitz,Germany:[s.n.],1998.
  • 4任禾,曾隽芳.一种基于信息熵的中文高频词抽取算法[J].中文信息学报,2006,20(5):40-43. 被引量:22
  • 5姜维,王晓龙,关毅,赵健.基于多知识源的中文词法分析系统[J].计算机学报,2007,30(1):137-145. 被引量:29

二级参考文献24

  • 1邹纲,刘洋,刘群,孟遥,于浩,西野文人,亢世勇.面向Internet的中文新词语检测[J].中文信息学报,2004,18(6):1-9. 被引量:59
  • 2赵健,王晓龙,关毅.中文名实体识别中的特征组合与特征融合的比较[J].计算机应用,2005,25(11):2647-2649. 被引量:7
  • 3姜维,王晓龙,关毅,徐志明.应用粗糙集理论提取特征的词性标注模型[J].高技术通讯,2006,16(10):996-1000. 被引量:3
  • 4王还.现代汉语频率词典[M].北京:北京语言学院出版社,1986..
  • 5JY Nie,ML Hannan,W Jin.Unknown Word Detection and Segmentation of Chinese using Statistical and heuristic Knowledge[J].Communications of COLIPS,1995,Vol.5,47-57.
  • 6李荣陆.中文文本分类语料[DB],http://www.nlp.org.cn/docs/download.php?doc_id=281.
  • 7Keh-Jiann Chen,Wei-Yun Ma.Unknown Word Extraction for Chinese documents[A].Proceedings of COLING[C].Taiwan:Association for Computational Linguistics,2002,169-175.
  • 8R.Sproat,C.Shih.A statistical method for finding word boundaries in Chinese text[J].Computer Processing of Chinese and Oriental Languages,1990,Vol.4,No.4,336-351.
  • 9Xianping Ge,Wanda Pratt,Padhraic Smyth.Discovering Chinese Words from Unsegmented Text[A].SIGIR[C].Berkeley:ACM,1999,271 -272.
  • 10Sun Maosong,Shen Dayang,Benjamin K Tsou.Chinese Word Segmentation without Using Lexicon and Handcrafted Training Data[A].Proceedings of the 36th annual meeting on Association for Computational Linguistics[C].Montreal:Association for Computational Linguistics,1998,1265 -1271.

共引文献48

同被引文献57

引证文献7

二级引证文献23

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部