期刊文献+

基于共识和分类改善文档聚类的识别信息方法 被引量:6

Discrimination information method based on consensus and classification for improving document clustering
下载PDF
导出
摘要 不同的聚类算法用于设计各自的策略,然而,每种技术在执行特定数据集时都有一定的局限性。选择恰当的识别信息方法(DIM)可确保文档聚类的进行。针对这些问题提出一种基于共识和分类的文档聚类(DCCC)的DIM。首先,选择识别信息最大化聚类(CDIM)作为数据集生成初始聚类的解决方法,并使用两种不同的CDIM方法生成两个初始聚集;其次,使用不同的参数方法对两初始聚集再进行初始化,通过簇标签信息间的关系建立共识,最大限度地提高文档的识别数总和;最后,选择识别文本权重分类(DTWC)作为文本分类器给共识分配新的簇标签,通过训练文本分类器更改基础分区,并根据预报标签信息生成最后的分区。采用8个网络数据集进行实验,选择BCubed的精度和召回率指标进行聚类验证。实验结果表明,所提出的共识分类方法的聚类结果优于对比方法的聚类结果。 Different clustering algorithms are used to design their own strategies.However,each technology has certain limitations when it executes a particular dataset.An adequate choice of Discrimination Information Method(DIM)can ensure the document clustering.To solve these problems,a DIM of Document Clustering based on Consensus and Classification(DCCC)was proposed.Firstly,Clustering by DIM(CDIM)was used to solve the generation of initial clustering for dataset,and two initial cluster sets were generated by two different CDIMs.Then,two initial cluster sets were initialized again by different parameter methods,and a consensus was established by using the relationship between the cluster label information,so as to maximize the sum of documents’discrimination number.Finally,Discrimination Text Weight Classification(DTWC)was chosen as text classifier to assign new cluster label to the consensus,the base partitions were altered by training the text classifier,and the final partition was obtained based on the predicted label information.Experiments on 8 network datasets for clustering verification by BCubed’s precision and recall index were carried out.Experimental results show that the clustering results of the proposed consensus and classification method are superior to those of comparison methods.
作者 王留洋 俞扬信 陈伯伦 章慧 WANG Liuyang;YU Yangxin;CHEN Bolun;ZHANG Hui(Faculty of Computer&Software Engineering,Huaiyin Institute of Technology,Huai’an Jiangsu 223003,China)
出处 《计算机应用》 CSCD 北大核心 2020年第4期1069-1073,共5页 journal of Computer Applications
基金 国家自然科学基金资助项目(61602202)。
关键词 共识聚类 文档聚类 识别信息 簇标签 文本分类器 consensus clustering document clustering discrimination information cluster label text classifier
  • 相关文献

参考文献8

二级参考文献79

共引文献140

同被引文献84

引证文献6

二级引证文献4

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部