摘要
现有文档关系分析模型难以从主题层次上判别文档相关性。为此,提出了一个基于主题的概率文档相关模型(TPDC)。TPDC借助Latent Dirichlet Allocation模型学习文档的主题结构;在计算出主题后验概率和主题相似度的基础上推导出文档后验概率;基于文档后验概率构建文档相关性分析模型。实验结果证明,TPDC模型在文档检索精度和文档压缩程度两方面优于向量空间模型,因而更能胜任实际应用中的文档检索任务。
Existing models on document relationship analysis have a difficulty in learning document correlation from topic level. To overcome this difficulty, a topic-based probabilistic document correlation model (TPDC) was proposed. The model learns the topic structure of a document through the latent dirichlet allocation model, infers the posterior probability of a document by computing the posterior probability of its topics and topic similarity, and then constructs the document correlation model based on the document posterior probability. Experimental results show that the TPIX2 model outperforms the vector space model in retrieval precision and document compression. So the TPDC model is more competent for document retrieval tasks in application.
出处
《计算机科学》
CSCD
北大核心
2008年第10期178-180,218,共4页
Computer Science
基金
广东省自然科学基金项目(07006474)
广东省科技攻关项目(2007B010200044)
关键词
主题
主题相似性
文档相关性
文本挖掘
Topic, Topic similarity,Document correlation,Text mining