摘要
针对自然语言处理中传统文档表示方法上下文语义信息不全,干扰词多等问题,提出了一种基于LDA-TFIDF和Word2vec的文档表示方法。首先对数据集进行分词、去停用词等预处理;其次,利用LDA主题模型和TF-IDF抽取文档中具有表征性的特征词,并计算对应权重;最后,应用数据集训练Word2vec模型获取词向量,并将抽取的特征词权重融入Word2vec词向量构建文档语义向量。通过分类任务对该方法进行验证,实验结果表明,与已有方法相比该方法在垃圾短信数据集上表现效果更佳,验证了方法的有效性。
Aiming at the problems of incomplete contextual semantic information and many interfering words in traditional document representation methods in natural language processing,a document representation method based on LDA-TF-IDF and Word2vec is proposed.Firstly,the data set is preprocessed by word segmentation and stopping words.Secondly,the LDA topic model and TF-IDF are used to extract the characteristic words in the document,and the corresponding weight is calculated.Finally,the data set is used to train the Word2vec model to obtain word vectors,and the extracted feature word weights are integrated into Word2vec word vectors to construct document semantic vectors.The proposed method is verified by a classification task.The experimental results show that the proposed method performs better on the spam SMS data set than the existing methods,which verifies the effectiveness of the proposed method.
作者
彭俊利
王少泫
陆正球
李兴远
PENG Junli;WANG Shaoxuan;LU Zhengqiu;LI Xingyuan(Zhejiang Fashion Institute of Technology,Ningbo,Zhejiang,315211,China)
出处
《浙江纺织服装职业技术学院学报》
2023年第2期91-96,共6页
Journal of Zhejiang Fashion Institute of Technology
基金
浙江省访问工程师项目(编号:FG2021133)
浙江纺织服装职业技术学院科研课题(编号:2022-2B-013)(编号:2022-2B-005)(编号:2021-2B-008)。