摘要
针对已有Web文档语义标注技术在标注完整性方面的缺陷,将潜在狄里克雷分配(LDA)模型用于对Web文档添加语义标注。考虑到Web文档具有明显的领域特征,在传统的LDA模型中嵌入领域信息,提出Domain-enable LDA模型,提高了标注结果的完整性并避免了对词汇主题的强制分配;同时在文档隐含主题和文档所在领域本体概念间建立关联,利用本体概念表达的语义对隐含主题进行准确的解释,使文档的语义清晰化,为文档检索提供有效帮助。根据LDA模型可为每个词汇分配隐含主题的特征,提出多粒度语义标注的概念。在20news-group和WebKB数据集上的实验证明了Domain-enable LDA模型的有效性,并指出对文档进行多粒度标注有助于有效处理不同类型查询。
Concerning the Web document annotation techniques available have weakness in integrity annotation,Latent Dirichlet Allocation(LDA) model was applied to semantic annotation.By embedding document domain information to LDA model,a new LDA model called domain-enabled LDA was introduced.An association between the statistical topical model and domain ontology was established,so the implied topic generated could be interpreted by concepts and an explicit semantic in document was acquired.Because the LDA model assigned a topic to each word in document,a multi-granularity annotation strategy was proposed.The experiments on 20news-group and WebKB show that the domain-enabled LDA model proposed can improve the annotation effectiveness and the multi-granularity annotation method helps different types of query in information retrieval.
出处
《计算机应用》
CSCD
北大核心
2010年第A12期3401-3406,共6页
journal of Computer Applications
基金
国家自然科学基金资助项目(60873196)
关键词
统计主题模型
本体
语义标注
概念
信息检索
statistical topical model
ontology
semantic annotation
concept
information retrieval