摘要
传统的浅层文本聚类方法在对短文本聚类时,面临上下文信息有限、用词不规范、实际意义词少等挑战,导致文本的嵌入表示稀疏、关键特征难以提取等问题。针对以上问题,文中提出一种融合简单数据增强方法的深度聚类模型SSKU(SBERT SimCSE K-means Umap)。该模型采用SBERT对短文本进行嵌入表示,利用无监督SimCSE方法联合深度聚类K-Means算法对文本嵌入模型进行微调,改善短文本的嵌入表示使其适于聚类。使用Umap流形降维方法学习嵌入局部的流形结构来改善短文本特征稀疏问题,优化嵌入结果。最后使用K-Means算法对降维后嵌入进行聚类,得到聚类结果。在StackOverFlow,Biomedical等4个公开短文本数据集进行大量实验并与最新的深度聚类算法作对比,结果表明所提模型在准确度与标准互信息两个评价指标上均表现出良好的聚类性能。
Traditional shallow text clustering methods face challenges such as limited context information,irregular use of words,and few words with actual meaning when clustering short texts,resulting in sparse embedding representations of the text and difficulty in extracting key features.To address these issues,a deep clustering model SSKU(SBERT SimCSE Kmeans Umap)incorporating simple data augmentation methods is proposed in the paper.The model uses SBERT to embed short texts and fine-tunes the text embedding model using the unsupervised SimCSE method in conjunction with the deep clustering KMeans algorithm to improve the embedding representation of short texts to make them suitable for clustering.To improve the sparse features of short text and optimize the embedding results,Umap manifold dimension reduction method is used to learn the local manifold structure.Using K-Means algorithm to cluster the dimensionality-reduced embeddings,and the clustering results are obtained.Extensive experiments are carried out on four publicly available short text datasets,such as StackOverFlow and Biomedical,and compared with the latest deep clustering algorithms.The results show that the proposed model exhibits good clustering performance in terms of both accuracy and standard mutual information evaluation metrics.
作者
贺文灏
吴春江
周世杰
何朝鑫
HE Wenhao;WU Chunjiang;ZHOU Shijie;HE Chaoxin(School of Information and Software Engineering,University of Electronic Science and Technology of China,Chengdu 610054,China)
出处
《计算机科学》
CSCD
北大核心
2023年第11期71-76,共6页
Computer Science
关键词
短文本
深度聚类
预训练模型
降维方法
自然语言处理
Short text
Deep clustering
Pre-training model
Dimension reduction
Natural language processing