摘要
针对大多数现有的深度文本聚类方法在特征映射过程中过于依赖原始数据质量以及关键语义信息丢失的问题,提出了一种基于关键语义信息补足的深度文本聚类算法(DCKSC)。该算法首先通过提取关键词数据对原始文本数据进行数据增强;其次,设计了一个关键语义信息补足模块对传统的自动编码器进行改进,补足映射过程中丢失的关键语义信息;最后,通过综合聚类损失与关键词语义自动编码器的重构损失学习适合于聚类的表示特征。实验证明,提出算法在五个现实数据集上的聚类效果均优于当前先进的聚类方法。聚类结果证明了关键语义信息补足方法和文本数据增强方法对深度文本聚类的重要性。
The most existing deep text clustering methods only use traditional autoencoder to learn representation for clustering,and neglect the problems with over-reliance on raw data quality and loss of key semantic information during feature mapping.This paper proposed a deep document clustering method via key semantic information complementation(DCKSC)mo-del.The DCKSC model firstly enriched the original text data by extracting keyword data.Secondly,this model designed a key semantic information complement module which used data enhancement representation to improve the traditional autoencoder,and compensated for the key semantic information lost in the mapping process.Finally,the algorithm synthesized the clustering loss and the reconstruction loss of the keyword semantic autoencoder,optimized the cluster label assignment and learned the presentation characteristics suitable for clustering.Experimental results show that DCKSC is superior to many mainstream deep document clustering algorithms.
作者
郑璐依
黄瑞章
任丽娜
白瑞娜
林川
Zheng Luyi;Huang Ruizhang;Ren Lina;Bai Ruina;Lin Chuan(State Key Laboratory of Public Big Data,Guizhou University,Guiyang 550025,China;College of Computer Science&Technology,Guizhou University,Guiyang 550025,China)
出处
《计算机应用研究》
CSCD
北大核心
2023年第6期1653-1659,共7页
Application Research of Computers
基金
国家自然科学基金资助项目(62066007)。
关键词
深度文本聚类
表征学习
自动编码器
自监督聚类
数据增强
deep text clustering
representative learning
autoencoder
self-supervised clustering
data argumentation