摘要
垃圾邮件过滤就是对邮件做出是垃圾或非垃圾的判断。传统的表示邮件的方法是在向量空间模型基础上通过信息增益等特征选择方法提取一部分词来表示邮件内容,存在语义信息不足的问题。该文提出一种将传统方法和词共现模型结合起来表示邮件特征的新方法,再采用交叉覆盖算法对邮件进行分类得到邮件分类器。实验表明,该文提出的邮件过滤算法与传统方法相比提高了过滤性能,词共现选择的维度要比传统方法选择的维度更具有代表性。
The aim of spam filtering is to distinguish the spam and the ham. The traditional methods used vector space model and feature selection approaches to extract features representing the contents of emails. However, these methods do not take the semantic information among words into account. In this paper, a new method is proposed to extract email features by combining the vector space model and the term co-occurrence, The covering algorithm is then employed to classify emails. Experiments show that the proposed method significantly improves the filtering performances compared with traditional ones. The features selected by utilizing term co-occurrence model are more representative than those chosen by the vector space model.
出处
《中文信息学报》
CSCD
北大核心
2009年第6期61-66,71,共7页
Journal of Chinese Information Processing
基金
国家重点基础研究973计划资助项目(2004CB318108
2007CB311003)
国家自然科学基金资助项目(60675031)
教育部社科研究基金青年资助项目(07JC870006)
关键词
计算机应用
中文信息处理
向量空间模型
垃圾邮件过滤
词共现模型
交叉覆盖算法
computer application
Chinese information processing
vector space model
spam filter
term co-occurrence model
covering algorithm