摘要
为快速识别冒犯性评论文本中的用户热点主题,解决传统主题模型在处理评论文本时语义描述不充分、上下文信息丢失和主题连贯性不强,以及K-means聚类算法对K值和初始中心点敏感的问题。使用CoSENT(cosine sentence)模型获取包含冒犯性语言的评论文本的句子级向量特征,对通过统一流形逼近与投影算法即UMAP(uniform manifold approximation and projection)模型降维后的向量矩阵使用基于Canopy+的改进K-means算法进行类簇划分,用(class term frequency-inverse document frequency,c-TF-IDF)识别各主题簇的主题特征,进行主题建模。通过对比冒犯性评论文本数据集以及普通评论数据集的实验验证了方法有效性。结果表明本文方法能够得到更好的主题一致性。
To quickly identify users'hot topics in offensive comment texts and solve the problems of insufficient semantic description,loss of contextual information,and weak topic coherence of traditional topic models when dealing with comment texts,as well as the sensitivity of K-value and initial centroid of K-means clustering algorithm.The CoSENT(cosine sentence)model was used in this paper to obtain sentence-level vector features of comment texts containing offensive language.An improved K-means algorithm based on Canopy+was used for class clustering on the vector-matrix after dimensionality reduction through the UMAP(uniform manifold approximation and projection)model.c-TF-IDF(class term frequency-inverse document frequency)was used to identify the thematic features of each thematic cluster for thematic modeling.The validity of the method is verified through experiments comparing the offensive comment text dataset as well as the ordinary comment dataset.The results show that the method in this paper can get better topic consistency.
作者
陈健飞
卜凡亮
王一帆
CHEN Jian-fei;BU Fan-liang;WANG Yi-fan(School of Information Network Security,People's Public Security University of China,Beijing 100038,China)
出处
《科学技术与工程》
北大核心
2024年第31期13442-13449,共8页
Science Technology and Engineering
基金
中国人民公安大学安全防范工程双一流专项(2023SYL08)。