摘要
以字为单位处理中文邮件存在着很大困难,针对于此,笔者引进中文分词算法,设计并实现了基于分词的垃圾邮件过滤系统,并且阐述了与实现相关的3个关键算法:用于关键词匹配的多模式相似/精确匹配算法,用于中文邮件处理的中文分词算法,以及用于特征提取的N元特征提取算法.最后实验证明了该系统对中英文垃圾邮件过滤都有很高的性能.另外,文章第三部分还给出了基于分词的非垃圾邮件分类系统的设计与实现.
It is difficult to process Chinese Emails by single word.In order to resolve this problem,a spam-filtering system based on words segmentation is designed and implemented.Meanwhile,several related algorithms when implemented are demonstrated: Multiple Exact/Approximate String Matching Algorithm for matching,Chinese Words Segmentation Algorithm for processing Chinese emails and N-Gram Feature Extraction Algorithm for feature extraction.Finally,the experiment validates the high performance of processing Chinese Emails as well as that of English Emails.In addition,a sortion system for NonSpam is designed and implemented in the third part of the paper
出处
《武汉大学学报(理学版)》
CAS
CSCD
北大核心
2005年第S2期191-194,共4页
Journal of Wuhan University:Natural Science Edition
基金
国家自然科学基金资助项目(90104005
60373089)
湖北省科技攻关项目(2002AA101C44)
关键词
垃圾邮件
多模式相似/精确匹配
中文分词
N元特征提取
spam
multiple exact/approximate string matching
chinese words segmentation
N-gram feature extraction