摘要
文章研究分级聚类与平面划分结合方法在网页分类中的应用。阐述了网页分类问题中样本特征分布的特点和复杂性,分级聚类能够生成层次化的嵌套类,且具有较高的准确度,但具有较高的计算复杂度,不适合计算大量样本的计算问题。K-均值算法受初始聚类中心的选择影响较大,对于不规则分布的样本往往聚类的效果不佳。文章考虑利用少数样本和分级聚类算法进行样本集合的初始聚类中心的划分,再利用K-均值算法对整个样本集合做聚类,则既可以避免分级聚类算法的计算复杂又可充分利用K-均值算法的快速特点;另一方面则利用了分级聚类算法准确度高为确定初始聚类中心提供了可靠的方法。文中给出了纯K-均值方法、分级聚类与平面划分结合方法在解决文本分类问题上的实验结果。
This paper proposes combination of layered clustering&plans partition and its application in Web pages classification.In this paper the feature distribution and complexity of samples in Web pages classification are described.But for layered clustering method,layered nesting class can be generated and provided with upper nicety.By the way,layered clustering methods have more high computing complexity and are not suiting to large number of samples.K-mean methods are usually sensitive to initial clustering centers and propose bad results for irregular distributed samples.In the paper,firstly,part samples are used in layered clustering to generate original clustering centers.Secondly,K-mean methods are loaded continuing to classify the whole samples set.This strategy can avoid computing complexity of layered clustering methods and also take full advantage of fast classifying of K-mean method.On the other hand,this strategy imposes that layered clustering methods have high nicety and provide suitable initial clustering centers.Lastly,this paper provides Web pages clustering experiments for K -mean methods and combination of layered clustering&plans partition.
出处
《计算机工程与应用》
CSCD
北大核心
2004年第35期139-141,204,共4页
Computer Engineering and Applications
基金
浙江省教育厅科研项目(编号:20030717)
浙江师范大学计算机应用校级重点学科资助