摘要
W eb挖掘中,根据内容对W eb文档进行分类是至关重要的一步。在W eb文档分类中一种通常的方法是层次型分类方法,这种方法采用自顶向下的方式把文档分类到一个分类树的相应类别。然而,层次型分类方法在对文档进行分类时经常产生待分类的文档在分类树的上层分类器被错误地拒绝的现象(阻塞)。针对这种现象,采用了以分类器为中心的阻塞因子去衡量阻塞的程度,并介绍了两种新的层次型分类方法,即基于降低阈值的方法和基于限制投票的方法,去改善W eb文档分类中文档被错误阻塞的情况。
One common approach in Web text classification is hierarchical text classification that involves associating classifiers with nodes in the category tree and classifying text documents in a top-down manner. However, the hierarchical text classification methods suffer from bloc- king "which refers to documents wrongly rejected by the classifiers at higher-levels and cannot be passed to the classifiers at lower-levels. In this paper,we use a classifier-centric performance measure known as blocking factor to determine the extent of the blocking and use two methods, namely ,Threshold Reduction, Restricted Voting to address the blocking problem in Web text classification.
出处
《计算机应用与软件》
CSCD
北大核心
2007年第1期58-60,128,共4页
Computer Applications and Software
关键词
数据挖掘
WEB挖掘
分类
Data mining Web mining Classification