摘要
为了从评论中分类提取产品属性,使得评论能够按照不同产品属性分别进行展示,提高消费者作出购买决策的效率,文中提出了基于种子约束LDA(隐含Dirichlet分布)的产品属性提取方法。该方法首先利用词频-逆文档频度(TF-IDF)算法自动提取关键词,作为属性种子集;接着对文档进行初次重组和二次重组,使二次重组后的文档只对一个产品属性进行描述,以解决长文本多属性类共现问题和短文本稀疏性问题,提高文档重组率;然后应用must-link和cannot-link两种种子约束定义概率扩缩值,通过对吉布斯采样过程的约束来影响LDA的主题分配,使得训练结果更加合理;最后将种子约束LDA生成的主题映射到先验属性类别上。定性分析(属性类别、属性词)和定量分析(准确率、熵值、纯度)结果表明,文中方法的准确率和纯度均高于现有的比较方法,而熵值低于现有的比较方法,说明了文中方法具有更好的聚类效果。
In order to improve consumers'efficiency of making purchasing decisions,this paper proposed a product attribute extraction method based on seed-constrained-LDA(Latent Dirichlet Allocation),which classify and extract product attributes from reviews,so that reviews can be displayed according to different product attributes.Specifically,the extraction method used the term frequency-inverse document frequency(TF-IDF)algorithm to automatically extract keywords as an attribute seed set.Then,it reorganized the document twice.The twice-reorganized document only describes one product attribute,so the multi-attribute co-occurrence problem of long text and the sparsity problem of short text can be solved,and the reorganization rate of document can be improved.Next,the must-link and cannot-link seed constraints were applied to define the probability expansion and contraction value,which affects the topic allocation of the LDA model and makes the training results more reasonable by constraints on the Gibbs sampling process.Finally,the topics generated by the seed-constraint-LDA were mapped to the prior attribute categories.The results of qualitative analysis(attribute categories,attribute words)and quantitative analysis(accuracy rate,entropy value,purity)show that the accuracy and purity of the proposed method are higher than the existing comparison methods,and the entropy value is lower than that of the existing comparison methods,indicating that this method has better clustering effect.
作者
陈可嘉
郑晶晶
CHEN Kejia;ZHENG Jingjing(School of Economics and Management,Fuzhou University,Fuzhou 350116,Fujian,China)
出处
《华南理工大学学报(自然科学版)》
EI
CAS
CSCD
北大核心
2022年第6期37-48,70,共13页
Journal of South China University of Technology(Natural Science Edition)
基金
国家自然科学基金资助项目(71701019)
国家社会科学基金资助项目(19BTQ072)。
关键词
属性提取
词频-逆文档频度
LDA模型
种子约束
重组
属性类别映射
attribute extraction
term frequency-inverse document frequency
LDA model
seed constraint
reorganization
attribute category mapping