摘要
传统的用于Web日志聚类的算法大都需要用户指定聚类个数。提出了一种新的自适应聚类算法并对Web日志用户会话进行聚类。该算法基于凝聚聚类思想和划分聚类思想,用初始数据集中每2个会话之间的相异度作为距离的度量,合并距离小于一定阈值的两个会话以产生初始聚类,再根据一定的规则动态地合并距离最小的会话类或会话,算法的结果是产生自然的聚类。最后,通过比较会话聚类的内部距离和类间距离来验证算法的有效性。这种聚类算法的最大优点在于,他能够产生自动的聚类,而不需要用户事先指定需要产生的聚类个数,并且能有效识别孤立点。实验表明,这种聚类能够产生较高质量的聚类效果。
In most Web log clustering methods,the number of clusters is predefined and the clusters are highly dependent on the initial identification of elements that represent the clusters well. In this paper, we advance an adaptive clustering algo- rithm and use it on clustering user - sessions from Web log. The algorithm is based on agglomeration and division,which uses degree of dissimilitude as the distance between two user - sessions, merges two clusters or one session and a cluster according to some rules dynamically and produces natural clusters finally. The algorithm proves to be effective through comparing the average inner distance of a cluster and outer distances among clusters. The advantages of algorithm are that it can cluster without regard to the initial number of clusters and can identify outliers effectively.
出处
《现代电子技术》
2007年第24期139-142,共4页
Modern Electronics Technique
关键词
相异度
凝聚聚类算法
自适应聚类算法
用户会话
degree of dissimilitude
agglomerative clustering
adaptive clustering
user session