摘要
文本信息过滤技术中的一个重要问题是对文档进行特征选择,分析χ2统计量(Chi-square,CHI)的缺陷和不足,针对它对低文档频的特征项不可靠,不能说明词条和类别的相关性等缺点,进行改进,提出一种新的统计频率(Statistical Frequency,SF)算法,并将此算法应用到文本信息过滤系统中。实验结果表明,统计频率算法能够弥补上述不足,表现出良好的过滤效果。
One of the most important problems in text information filtering technology is feature selection, this paper analyzes Chi - square algorithm(CHI) , which is unreliable for low -document frequency, and can't show the pertinence for term and classification. A new Statistical Frequency algorithm (SF) is proposed and applied to text information filtering system. The experiments of the SF algorithm is validated by comparison, the results show that improved algorithm performs well.
出处
《图书情报工作》
CSSCI
北大核心
2009年第13期116-119,共4页
Library and Information Service
基金
2006年国家社会科学基金项目"网络信息过滤研究"(项目编号:06BTQ024)研究成果之一
关键词
文本过滤
特征选择
X^2统计量
text categorization feature selection chi -square