摘要
针对互联网存在的巨量涉及电力投诉的用户生成超短文本,本文提出一种基于深度嵌入的聚类模型,以实现互联网电力投诉文本话题识别的方法。首先,通过改进算法进行词嵌入,以提高文本特征的语义丰度并降低数据集维度;然后,在词嵌入的基础上,借助Sentence-Bert进行句子相似度计算,从而实现短文本聚类;最后,在自主爬取的互联网用户留言中涉及电力投诉的文本数据集上部署提出的方法,完成了投诉文本的话题聚类,并与多个已有的话题识别算法在同一数据集上的效果进行比较,证明了提出模型的有效性。
In view of the huge amount of Internet user-generated ultra-short text involving power complaints, a clustering model based on deep embedding is proposed to realize the topic recognition method of Internet power complaints text in this paper. Firstly, word embedding is carried out by an improved algorithm to enhance the semantic richness of text features and reduce the dimension of data set. Then, sentence similarity is calculated by using Sentence-Bert to realize short text clustering based on word embedding. Finally, the proposed method is deployed on the text data set involving power complaints in the self-crawling Internet user messages to complete the topic clustering of the complaint text, and the effect of several existing topic recognition algorithms on the same data set is compared, which proves the effectiveness of the proposed model.
出处
《计算机科学与应用》
2023年第4期853-864,共12页
Computer Science and Application