摘要
为了实现普通文本语料库和数字图书语料库的有效聚类,分别提出基于传统LDA(Latent Dirichlet Allo-cation)模型和TC-LDA模型的聚类算法.TC-LDA模型在LDA模型基础上进行扩展,通过对图书文档的目录和正文信息联合进行主题建模.和传统方法不同,基于主题模型的聚类算法能将具备同一主题的文档聚为一类.实验结果表明从主题分析角度出发实现的聚类算法优于传统的聚类算法.
To effectively cluster corpus of ordinary documents and digital books,the clustering algorithms based on LDA model and TC-LDA were proposed,respectively.The topic model named TC-LDA,the extension of LDA,is proposed for digital books corpus for jointly topic modeling from both of Texts and Contents.Unlike traditional clustering methods,topic model based methods cluster documents in a group if they share one or more common topics.Empirical evaluation demonstrates that our approach based on topic analysis can substantially improve the clustering results as compared to related methods.
出处
《电子学报》
EI
CAS
CSCD
北大核心
2012年第11期2346-2350,共5页
Acta Electronica Sinica
基金
国家自然科学青年基金(No.61103171
No.61103099)
浙江省公益性技术应用研究计划(No.2011C31048)