摘要
网页分类是网络信息检索研究的关键技术之一。文中针对分类技术中的特征选择方法展开研究。在分析、比较常用的文本分类特征选择方法基础上 ,提出了一种联合特征选择方法。该方法将已有的X2 统计方法和互信息方法综合起来 ,在标准文本网页数据集分类实验中 ,综合查全率和查准率得到明显的提高。该选择方法已应用于“网络指南针”
Web Pages Categorization is one of the key technologies for Web Pages Information Retrieval. This Paper proposes a novel feature selection method named Combined X 2 method, which combines X 2 method with Mutual Information method. Our Experiments based on real world data collected from Web, show that Combined X 2 method outperforms Mutual Information method, X 2 method, and other existing feature selection method based on X 2 Statistics. Finally, the research results in this paper has been applied in Network Compass system, a large scale hypertextual web search engine.
出处
《计算机应用》
CSCD
北大核心
2004年第7期119-121,共3页
journal of Computer Applications
基金
国家自然科学基金资助项目 (90 1 0 4 0 0 2 )
关键词
文本网页分类
特征选择
X^2统计量
互信息量
联合特征选择
Web pages categorization
feature selection
X 2 statistics
mutual information
combined X 2 feature selection