摘要
针对隐含狄利克雷分布(LDA)模型特征提取时忽略语义信息的问题,提出一种融合LDA和全局文本表示(GloVe)模型的病症文本聚类算法LG&K-Medoide。首先,利用LDA对病症文本数据建模,采用JS(Jensen-Shannon)距离计算文本相似度;其次,利用GloVe对病症文本数据建模获取词向量,根据病症词性贡献度,对词向量权重进行标注,采用余弦距离计算基于GloVe建模加权的文本相似度;最后,将两种相似度进行结合,改进距离公式,实现K-Medoide聚类。实验结果表明,LG&K-Medoide算法较基于LDA,LDA+TF-IDF,LDA+Word2Vec模型的聚类算法具有较高的精度。
Aiming at solving the problem of ignoring semantic information in LDA model feature extraction,a disease text clustering algorithm LG&K-Medoide based on LDA and GloVe model was proposed.First,LDA was used to model the disease text data,and the JS distance was used to calculate the text similarity;second,GloVe was used to model the disease text data to obtain the word vector,the weight of the word vector was labeled according to the contribution to part of speech from disease text,and the cosine distance was used to calculate weighted text similarity based on GloVe modeling;finally,the two similarities are combined to improve the distance formula to realize K-Medoide clustering.The experimental results show that the LG&K-Medoide algorithm has higher accuracy than the clustering algorithm based on LDA,LDA+TF-IDF and LDA+Word2 Vec models.
作者
吴迪
赵玉凤
WU Di;ZHAO Yufeng(School of Information and Electrical Engineering,Hebei University of Engineering,Handan,Hebei 056038,China)
出处
《河北工程大学学报(自然科学版)》
CAS
2022年第1期92-98,共7页
Journal of Hebei University of Engineering:Natural Science Edition
基金
河北省自然科学基金资助项目(F2020402003,F2019402428)。