摘要
针对社会化标签系统下Web资源存在大量潜在知识以及资源之间存在着独立性的问题,提出一种基于线性回归模型的单词加权潜在狄利克雷分布(LDA)的主题识别方法。通过线性回归模型建立任意文本资源之间的拟合函数,使用拟合函数获取每个资源的权重值,解决资源之间存在独立同分布的问题,并对拟合函数的数据点进行加权操作,进而实现语料库中每个单词的加权,最终获得字典单词的权重系数。在单词加权基础上建立单词加权LDA模型,通过吉布斯采样对Web资源的潜在主题进行深入挖掘。实验结果表明,相比传统主题模型,新的单词加权LDA算法在Web资源上具有更好的主题识别效果。
Aiming at the existence of a large amount of potential knowledge and the independence of resources in Web resources under the social tagging system,a word-weighted LDA(latent Dirichlet allocation)topic recognition method based on linear regression model is proposed.We establish a fitting function between arbitrary text resources through a linear regression model,use the fitting function to obtain the weight value of each resource to solve the independent and identical distribution characteristics in resources.And the weighting operation on the data points of the fitting function is used to achieve the weight of each word in the corpus,and finally obtain the weight coefficient of the dictionary word.A word-weighted LDA model is established on the basis of word weighting,and the potential topics of Web resources are deeply explored through Gibbs sampling.Experimental results show that the new recognition method has better topic recognition effects on Web resources than traditional topic models.
作者
邰悦
葛斌
TAI Yue;GE Bin(Anhui University of Science and Technology, Huainan 232001, China)
出处
《金陵科技学院学报》
2021年第2期39-45,共7页
Journal of Jinling Institute of Technology
基金
国家自然科学基金(51874003,61703005)
安徽省自然科学基金(1808085MG221)。