摘要
采用通用后缀树模型(GSTM),利用邮件内容的上下文信息,进行每个文本位置的不定长多元统计,从而获得被测邮件与不同训练集的相似程度,确定邮件所属的类别。理论分析和实验表明,在相同语料上,该方法的精确度和召回率均达到或超过了基于向量空间模型的邮件过滤方法;对于长度为N的邮件,过滤时间为O(N);长度为N的新邮件加入训练集,训练时间为O(N),满足了训练集的动态增长;该方法不需进行分词处理,完全独立于语种,适用于多语种邮件同时存在的情况。
The paper proposes a method of spam filtering based on content. It adopts general suffix tree model(GSTM), takes advantage of context location, and does string match of unfixed length, then computes the similarity between test mail and the corpus to determine the sort of E-mail. The experiments and analyses prove that the method is better than other methods based on vector space model(VSM) in both accuracy and recall when tested on the same corpus. The avoidance of word segmentation shows that the categorizing process is irrelevant with the concrete language and is a language independent method.
出处
《计算机工程》
CAS
CSCD
北大核心
2007年第9期100-102,共3页
Computer Engineering