摘要
近年来,将结构信息应用于深度文本聚类中以提升聚类效果取得了较优的成果。然而,结构信息的构造方法大多只进行简单的距离测算且近邻数量固定,导致构建的图难以获得较精确的文本结构信息。另外,众多方法对近邻文本只进行一阶挖掘,使图结构信息未得到完全挖掘,限制了结合结构信息的深度文本聚类性能。为此,提出一种基于自适应结构学习的深度文本聚类模型DCMBS。首先,设计一种阈值构图方法,动态调整近邻文本数量,解决因近邻文本固定存在结构信息不精确的问题;其次,引入一种拓扑探索近邻的方法,对近邻文本进行多阶挖掘,解决以往方法只进行一阶挖掘存在结构信息不完整的问题。此外,设计了1个阈值衰减策略,避免拓扑过程中因拓扑阶数增加导致学习泛化。在4个真实数据集的实验结果表明,DCMBS与现有较好的聚类模型相比,准确度、归一化互信息(NMI)和调整兰德指数(ARI)平均提高了6.83、2.93、6.23个百分点。
Recently,good results have been achieved by applying structural information to deep document clustering to improve the clustering effect.However,most current structural information construction methods only perform simple distance calculations and have a fixed number of neighbors,preventing the constructed graph from obtaining more accurate text structure information.In addition,many methods only perform first-order mining of nearby texts,resulting in the graph structure information not being fully mined,which limits the performance of deep document clustering combined with structural information.Therefore,this study proposes a deep document clustering model based on adaptive structural learning DCMBS.The DCMBS model first designs a threshold composition method to adjust the number of neighboring texts dynamically to solve the problem of inaccurate structural information caused by fixed neighboring texts.Second,DCMBS introduces a method of topological exploration of nearest neighbors to conduct multi-level mining of the nearest neighbor text,thereby solving the problem of incomplete structural information caused by one-level mining in previous methods.In addition,a threshold attenuation strategy is designed in the model to avoid learning generalization due to the increase in topology order in the topology process.The experimental results on four real datasets show that DCMBS has improved the accuracy,Normalized Mutual Information(NMI)and Adjusted Rand Index(ARI)by an average of 6.83,2.93 and 6.23 percentage points,respectively,compared with existing good clustering models.
作者
潘伟
黄瑞章
任丽娜
薛菁菁
PAN Wei;HUANG Ruizhang;REN Lina;XUE Jingjing(Text Computing and Cognitive Intelligence Engineering Research Center of National Education Ministry,Guizhou University,Guiyang 550025,Guizhou,China;State Key Laboratory of Public Big Data,Guizhou University,Guiyang 550025,Guizhou,China;College of Computer Science and Technology,Guizhou University,Guiyang 550025,Guizhou,China)
出处
《计算机工程》
CAS
CSCD
北大核心
2024年第11期89-97,共9页
Computer Engineering
基金
国家自然科学基金(62166007)
贵州省自然科学基金(黔科合基础ZK[2022]027)。
关键词
阈值
深度文本聚类
文本结构信息
图神经网络
自适应结构学习
threshold
deep document clustering
text structure information
Graph Neural Network(GNN)
adaptive structural learning