
一种词聚类LDA的商品特征提取算法 被引量:12

An Algorithm Based on Words Clustering LDA for Product Aspects Extraction
摘要 商品评论中经常会使用一些词义近似或上下文相关的中低频词来描述商品特征,如何有效辨识这些中低频词是商品特征抽取的一个难点.由于缺乏先验知识,主题模型难以发现并抽取中低频特征词.提出基于词义相似度和上下文相关度相结合的词聚类度量算法,在此基础上构建了一种基于词聚类先验知识的潜在狄利克雷分配的商品主题特征提取模型.首先对词项按词义相似度、上下文相关度进行聚类;然后在商品主题特征抽取中引入词聚类因素作为权重影响因子,使得同一个聚类簇中的词项属于同一主题的概率增加.相关实验结果表明,本文提出的词聚类和特征提取算法具有较好的效果. Product reviews often use some low-frequency synonyms or context-dependent words to describe the product aspects,and howto effectively identify these low-frequency words is a difficult problem in aspect extraction. Due to the lack of prior knowledge,it is difficult to find and extract the low-frequency aspect words by topic model directly. This paper proposes a method for word clustering in corpus of product reviews,and it takes semantic similarity and contextual relevance of words into account. Then based on the method we present a topic model by adding word clustering as a priori knowledge into the LDA for aspects extraction,we call it WCLDA. In the process of WC-LDA,word clustering can be implemented according to the distance of each two words calculated by similarity and contextual degree; Secondly,word clustering is introduced as a weighting factor in LDA for aspect extraction,which can increase the probability belonging to the same topic of the words that in the same cluster. Experimental results showthat the word clustering algorithm and WC-LDA model presented in this paper have a better effect.
出处 《小型微型计算机系统》 CSCD 北大核心 2015年第7期1458-1463,共6页 Journal of Chinese Computer Systems
基金 国家自然科学基金项目(61173146 61363010 61363039)资助 国家社会科学基金项目(12CTQ042)资助 江西省高等学校科技落地计划(产学研合作)项目(KJLD12022)资助 江西省自然科学基金重大项目(20152ACB20003) 江西省研究生创新专项项目(YC2013-B047)资助
关键词 词聚类 上下文相关 LDA模型 特征提取 word clustering contextual relevance Latent Dirichlet Allocation(LDA) model aspect extraction
  • 相关文献





  • 1冯建英,吴丹丹,王博,王智,穆维松.中文在线评论文本分析对生鲜农产品电商影响研究综述[J].农业机械学报,2021,52(S01):504-512. 被引量:7
  • 2王海涛,曹存根,高颖.基于领域本体的半结构化文本知识自动获取方法的设计和实现[J].计算机学报,2005,28(12):2010-2018. 被引量:31
  • 3郝占刚,王正欧.基于混沌社会演化算法的文本聚类新方法[J].系统工程学报,2007,22(1):109-112. 被引量:1
  • 4中国互联网络信息中心(CNNIC).第36次中国互联网络发展状况统计报告[EB/OL]. http: //www. cnnic. net. cn/hlwfzyj/hlwxzbg/hlwtjbg/201507/P020150723549500667087.pdf’ 2015-7-22.
  • 5MARTIN S,NEY H.Algorithms for bigram and trigram word clustering[C].In:Proc European Conference Speech Communication and Technology,Madrid,1995:1253-1256.
  • 6刘树杰,董力,张家骏,等.深度学习在自然语言处理中的应用[J].中国计算机学会通讯,2015,11(3):9-16.
  • 7BENGIO Y,DUCHARME R,VINCENT P,et al.A neural probabilistic language model[J].The Joural of Machine Research,2003(3):1137-1155.
  • 8MIKOLOV T,KOMBRINK S,BURGET L,et al.Extensions of recurrent neural network language model[C].Acoustics,Speech and Signal Processing(ICASSP),2011IEEE International Conference on,IEEE,2011:5528-5531.
  • 9Kazi F, Joshi S, Machchhar S, et al. Novel approach for online forum hotspot detection [ J ]. Data Mining and Knowledge Engineering, 2015,7 ( 6 ) : 203-208.
  • 10Devi K N, Bhaskaran V M. Rough set and entropy based feature selection for online forums hotspot detection [ J ]. International Journal of Computer Applications, 2015,117 (10) :37-41.










使用帮助 返回顶部