摘要
评论文本中的词符合幂律分布,使LDA模型词的分布偏向高频词,导致主题相似度大,表达能力下降。提出幂函数加权LDA(Latent Dirichlet Allocation)模型以提高低频词的表达能力。使用iForest算法,选择出与众不同且具有价值的评论集合。实验结果表明,选择的评论子集特征覆盖率较高,且有较高的平均信息量。
The words in review text conform to the power law distribution, which makes the distribution of LDA model tends to high-frequency words. Topics similarity is large and expression ability drops. Therefore, a power law function weighted LDA (Latent Dirichlet Allocation) model is proposed to improve the expressive power of low-frequency words. Finally, iForest algo- rithm is used to select a different and valuable set of comments. Experimental results show that the feature coverage of selected comment subsets is higher and it has higher average information.
出处
《软件导刊》
2018年第1期38-40,共3页
Software Guide