摘要
设计了一种基于主题的Web文本聚类方法(HTBC):首先根据文本的标题和正文提取文本的主题词向量,然后通过训练文本集生成词聚类,并将每个主题词向量归类到其应属的词类,再将同属于一个词类的主题词向量对应的文本归并到用对应词类的名字代表的类,从而达到聚类的目的.算法分四个步骤:预处理、建立主题向量、生成词聚类和主题聚类.同时,对HTBC与STC、AHC、KMC算法从聚类的准确率和召回率上做了比较,实验结果表明,HTBC算法的准确率较STC、AHC和KMC算法要好.
A clustering method-HTBC was devised based on theme.It extracts the Keywords according to the title and the main body of the document,trains the text sets to generate the word clustering,classifies each keyword to responding word cluster,combines the same thesis attribute to word cluster and finally realizes clustering.There are four steps for HTBC such as pretreatment,constructing the theme vector,generating the word cluster and theme clustering.The experimental data indicate HTBC are better than K-Means,AHC and STC in terms of accuracy and recall ratio after comparision.
出处
《成都大学学报(自然科学版)》
2010年第3期249-252,共4页
Journal of Chengdu University(Natural Science Edition)