摘要
针对文本聚类计算量大的特点,提出了一种将概念格和Newman快速算法两种理论相结合的聚类方法。首先将文本表示为特征词语集,用统计方法抽取特征向量;同时,用IDF权重计算公式来计算词语的权重,并将词语权值离散化;然后,用形式背景表达关键词,通过相似度公式,计算出形式概念相似度大小;最后,构造Newman网络,根据Newman网络算法规则对待聚类文本进行聚类。实例表明,该算法不仅得到了正确的分类结果,而且大大降低了算法的复杂度,Newman快速算法仅为O((m+n)n)。
According to the feature of great computation for text clustering,a new text clustering method is presented which takes the advantages of concept lattice and Newman fast algorithm.The algorithm firstly expresses the text as feature word set and the technology extracting feature vector by statistical method.Secondly,using the TFIDF weight formula computes the weight of words and making discrete in the words weight.Thirdly,using the form background expresses the keywords ,using similarity formula calculates the size of formal concept similarity.Fourth,building Newman network,clustering the text of cluster by the Newman network algorithm rule.Last but not least,the experiment shows the validity of this method.It is not only take the right sort results,but greatly reduces the complexity of the algorithm,Newman fast algorithm complexity only is O((m+n)n)
出处
《科学技术与工程》
2010年第30期7550-7553,共4页
Science Technology and Engineering