期刊文献+

基于LDA-TF-IDF和Word2vec文档表示 被引量:1

Document Representation Based on LDA-TF-IDF and Word2Vec
下载PDF
导出
摘要 针对自然语言处理中传统文档表示方法上下文语义信息不全,干扰词多等问题,提出了一种基于LDA-TFIDF和Word2vec的文档表示方法。首先对数据集进行分词、去停用词等预处理;其次,利用LDA主题模型和TF-IDF抽取文档中具有表征性的特征词,并计算对应权重;最后,应用数据集训练Word2vec模型获取词向量,并将抽取的特征词权重融入Word2vec词向量构建文档语义向量。通过分类任务对该方法进行验证,实验结果表明,与已有方法相比该方法在垃圾短信数据集上表现效果更佳,验证了方法的有效性。 Aiming at the problems of incomplete contextual semantic information and many interfering words in traditional document representation methods in natural language processing,a document representation method based on LDA-TF-IDF and Word2vec is proposed.Firstly,the data set is preprocessed by word segmentation and stopping words.Secondly,the LDA topic model and TF-IDF are used to extract the characteristic words in the document,and the corresponding weight is calculated.Finally,the data set is used to train the Word2vec model to obtain word vectors,and the extracted feature word weights are integrated into Word2vec word vectors to construct document semantic vectors.The proposed method is verified by a classification task.The experimental results show that the proposed method performs better on the spam SMS data set than the existing methods,which verifies the effectiveness of the proposed method.
作者 彭俊利 王少泫 陆正球 李兴远 PENG Junli;WANG Shaoxuan;LU Zhengqiu;LI Xingyuan(Zhejiang Fashion Institute of Technology,Ningbo,Zhejiang,315211,China)
出处 《浙江纺织服装职业技术学院学报》 2023年第2期91-96,共6页 Journal of Zhejiang Fashion Institute of Technology
基金 浙江省访问工程师项目(编号:FG2021133) 浙江纺织服装职业技术学院科研课题(编号:2022-2B-013)(编号:2022-2B-005)(编号:2021-2B-008)。
关键词 LDA主题模型 TF-IDF word2vec 文档表示 LDA topic model TF-IDF word2vec document representation
  • 相关文献

参考文献8

二级参考文献94

  • 1王燕.一种改进的K-means聚类算法[J].计算机应用与软件,2004,21(10):122-123. 被引量:9
  • 2张玉芳,彭时名,吕佳.基于文本分类TFIDF方法的改进与应用[J].计算机工程,2006,32(19):76-78. 被引量:120
  • 3PARK E K, RA D Y, JANG M G. Techniques for improving Web retrieval effectiveness[J]. Information Processing Management, 2005, 41(5): 1207 -1223.
  • 4LIU W Y, HAO T Y, CHEN W, et al. A Web-based platform for user-interactive question-answering[J]. World Wide Web, 2009, 12(2): 107 -124.
  • 5SALTON G, WONG A, YANG C S. A vector space model for auto-matic indexing[J]. Communications of the ACM, 1975, 18 ( 11) : 613 -620.
  • 6PHAN X H, NGUYEN M L, HORIGUCHI S. Learning to classify short and sparse text & Web with hidden topics from large-scale data collections[C] / / Proceedings of the 17 th Conference on World Wide Web. New York: ACM, 2008: 91 -100.
  • 7WANG L, JIA Y, HAN W H. Instant message clustering based on extended vector space model[C] / / Proceedings of the 2nd Interna-tional Conference on Advances in Computation and Intelligence. Berlin: Springer-Verlag, 2007: 435 - 443.
  • 8SAHAMI M, HEILMAN T D. A Web - based kernel function for measuring the similarity of short text snippets[C] / / Proceedings of the 15th Conference on World Wide Web. New York: ACM, 2006: 377 -386.
  • 9YIH W, MEEK C. Improving similarity measures for short segments of text[C] / / Proceedings of the 22nd Conference on Artificial Intel-ligence. Menlo Park: AAAI Press, 2007: 1489 -1494.
  • 10BANERJEE S, RAMANATHAN K, GUPTA A. Clustering short texts using Wikipedia[C] / / Proceedings of the 30th Annual Inter-national ACM SIGIR Conference on on Research and Development in Information Retrieval. New York: ACM, 2007: 787 -788.

共引文献345

同被引文献1

引证文献1

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部