摘要
文档聚类是将文档集自动归成若干类别的过程,是对文本信息进行分类的有效方式。为了解决半结构化的文本数据转化为结构化数据时出现的数据高维性问题,本文提出了一种卷积自编码器的文档聚类模型CASC,利用卷积神经网络和自编码器的特征提取能力,在尽可能保留原始数据内部结构的同时,将其嵌入到低维潜在空间,然后使用谱聚类算法进行聚类。实验表明,CASC模型在保证聚类准确率不降低的前提下减少了算法运行时间,同时也降低了算法时间复杂度。
Document clustering is a process of automatically categorizing document sets into several categories and is an effective means of organizing textual information. Aiming at the problem of high dimensionality of data when converting semi-structured text data into structured data,this paper proposes a document clustering model called Convolutional Self-Encoder(CASC),which uses convolutional neural network and self-encoder feature extraction capabilities,the best possible to retain the internal structure of the original data while embedded in low-dimensional potential space,and then use the spectral clustering algorithm for clustering. Experiments show that the CASC algorithm can reduce the algorithm running time and reduce the time complexity of the algorithm without reducing the accuracy of clustering.
作者
冯永强
李亚军
FENG Yongqiang;LI Yajun(Tianjin Haihe Dairy Company,Tianjin 300410,China;Tianjin University of Science and Technology College of ComputerScience and Information Engineering,Tianjin 300457,China)
出处
《现代信息科技》
2018年第2期12-15,共4页
Modern Information Technology
基金
天津市科技计划项目(17KPXMSF00140
17ZLZXZF00470)
天津市科技项目(KJCX-KFQ-CXY-2016-003)
关键词
聚类
卷积神经网络
自编码器
无监督模型
clustering
convolution neural network
autoencoder
unsupervised model