摘要
对基于混合相似度的HTFC算法进行改进,要做的预处理是:建立向量空间模型,计算文档和链接的混合相似度。算法过程是:首先随机选取√kn个文档进行层次聚类,直到剩k个聚簇为止;对这k个聚簇不断迭代直到集合元素不再变化为止;然后表示出每类;最后通过用户对结果的反馈使得新生成的簇继续迭代,最终满足用户需求。算法第1步采用的是改进的k-means算法,可提高运行效率。反馈机制对原有模型进一步修正,从而提高精度。
Improvement of HTFC algorithm based on mixed similarity is engaged. Pre-processes to be done are: building up vector space model, computing mixed similarity according to text and hyperlink. Procedure of algorithm is: firstly choose 4 kn texts at random, agglomerative clustering is executed until the number of clusters is left k, secondly iteration is repeated until elements in the set keep stability; then show each class; lastly the feedback to result can iterate again to stabilize newly cluster. By adoption of improved k-means algorithm, performance can be enhanced. The improvement of feedback to prototype can also upgrade precision.
出处
《计算机工程与设计》
CSCD
北大核心
2005年第10期2685-2687,共3页
Computer Engineering and Design
基金
上海市教育委员会科研基金项目(04EB12)
关键词
文本聚类算法
信息检索
WEB挖掘
text clustering algorithm
information retrieval
web mining