摘要
商品评论中经常会使用一些词义近似或上下文相关的中低频词来描述商品特征,如何有效辨识这些中低频词是商品特征抽取的一个难点.由于缺乏先验知识,主题模型难以发现并抽取中低频特征词.提出基于词义相似度和上下文相关度相结合的词聚类度量算法,在此基础上构建了一种基于词聚类先验知识的潜在狄利克雷分配的商品主题特征提取模型.首先对词项按词义相似度、上下文相关度进行聚类;然后在商品主题特征抽取中引入词聚类因素作为权重影响因子,使得同一个聚类簇中的词项属于同一主题的概率增加.相关实验结果表明,本文提出的词聚类和特征提取算法具有较好的效果.
Product reviews often use some low-frequency synonyms or context-dependent words to describe the product aspects,and howto effectively identify these low-frequency words is a difficult problem in aspect extraction. Due to the lack of prior knowledge,it is difficult to find and extract the low-frequency aspect words by topic model directly. This paper proposes a method for word clustering in corpus of product reviews,and it takes semantic similarity and contextual relevance of words into account. Then based on the method we present a topic model by adding word clustering as a priori knowledge into the LDA for aspects extraction,we call it WCLDA. In the process of WC-LDA,word clustering can be implemented according to the distance of each two words calculated by similarity and contextual degree; Secondly,word clustering is introduced as a weighting factor in LDA for aspect extraction,which can increase the probability belonging to the same topic of the words that in the same cluster. Experimental results showthat the word clustering algorithm and WC-LDA model presented in this paper have a better effect.
出处
《小型微型计算机系统》
CSCD
北大核心
2015年第7期1458-1463,共6页
Journal of Chinese Computer Systems
基金
国家自然科学基金项目(61173146
61363010
61363039)资助
国家社会科学基金项目(12CTQ042)资助
江西省高等学校科技落地计划(产学研合作)项目(KJLD12022)资助
江西省自然科学基金重大项目(20152ACB20003)
江西省研究生创新专项项目(YC2013-B047)资助
关键词
词聚类
上下文相关
LDA模型
特征提取
word clustering
contextual relevance
Latent Dirichlet Allocation(LDA) model
aspect extraction