期刊文献+

混合词汇特征和LDA的语义相关度计算方法 被引量:5

Combing lexical features and LDA for semantic relatedness measure
下载PDF
导出
摘要 文本语义相关度计算在自然语言处理、语义信息检索等方面起着重要作用,以Wikipedia为知识库,基于词汇特征的ESA(Explicit Semantic Analysis)因简单有效的特点在这些领域中受到学术界的广泛关注和应用。然而其语义相关度计算因为有大量冗余概念的参与变成了一种高维度、低效率的计算方式,同时也忽略了文本所属主题因素对语义相关度计算的作用。引入LDA(Latent Dirichlet Allocation)主题模型,对ESA返回的相关度较高的概念转换为模型的主题概率向量,从而达到降低维度和提高效率的目的;将JSD距离(Jensen-Shannon Divergence)替换余弦距离的测量方法,使得文本语义相关度计算更加合理和有效。最后对不同层次的数据集进行算法的测试评估,结果表明混合词汇特征和主题模型的语义相关度计算方法的皮尔逊相关系数比ESA和LDA分别高出3%和9%以上。 Computing semantic relatedness in text documents is a key problem in many domains,for example,NaturalLanguage Processing(NLP),Semantic Information Retrieval(SIR),etc.ESA(Explicit Semantic Analysis)for Wikipediahas received wide attention and applied mainly because of its simplicity and effectivity.However,use of ESA insemantic relatedness computation is inefficient due to its redundant concepts and high dimensionality.This paper presentsa new technique based on LDA(Latent Dirichlet Allocation)and JSD(Jensen-Shannon Divergence)to computer semanticrelatedness between text documents.The LDA is employed to reduce dimensionality and improve efficiency,and is usedto build topic model probability vector from highly dimensional document matrix.Instead of cosine distance,JSD is usedto compute semantic relatedness between documents.The results show that this technique based on LDA and JSD is moreeffective than ESA.Several benchmark test results have been presented to compare proposed technique with other methods.The results of experiment show that the proposed technique provides an increase of above3%and9%in Pearson correlationcoefficient than ESA and LDA,respectively.
作者 肖宝 李璞 蒋运承 XIAO Bao;LI Pu;JIANG Yuncheng(School of Electronics and Information Engineering, Qinzhou University, Qinzhou, Guangxi 535011, China;School of Computer Science, South China Normal University, Guangzhou 510631, China;Software Engineering College, Zhengzhou University of Light Industry, Zhengzhou 450000, China)
出处 《计算机工程与应用》 CSCD 北大核心 2017年第12期152-157,165,共7页 Computer Engineering and Applications
基金 国家自然科学基金(No.61272066) 广州市科技计划项目(No.2014J4100031) 广西高校中青年教师基础能力提升项目(No.KY2016LX431)
关键词 主题模型 词汇特征 显式语义分析(ESA) 隐含狄利克雷分布(LDA) 语义相关度计算 topic model lexical features Explicit Semantic Analysis(ESA) Latent Dirichlet Allocation(LDA) semantic relatedness measure
  • 相关文献

参考文献2

二级参考文献36

  • 1Buchanan B G, Feigenbaum E A. Forward//Davis R, Lenat D B.Knowledge-Based Systems in Artificial Intelligence. New York: McGraw-Hill, 1982:39-51.
  • 2Lenat D, Guha R. Building Large Knowledge Based Systems. New York: Addison Wesley, 1990.
  • 3Ricardb B Y, Berthier R N. Modern Information Retrieval. New York: Addison Wesley, 1999.
  • 4Deerwester S, Dumais S, Furnas G, Landauer T, Harshman R. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 1990, 41(6): 391-407.
  • 5Alexander B, Graeme H. Evaluating wordnevbased measures of lexical semantic relatedness. Computational Linguistics, 2006, 32(1): 13-47.
  • 6Mario J. Roget's thesaurus as a lexlcal resource for natural language processing [Ph. D. dissertation]. University of Ottawa, Ottawa, 2003.
  • 7Milne D, Witten I H. An effective, low-cost measure of semantic relatedness obtained from Wikipedia links// Proceedings of the 23th Association for the Advancement of Artificial Intelligence. Chicago, US, 2008:25-30.
  • 8Philip R. Using information content to evaluate semantic similarity in a taxonomy//Proceedings of the 14th Interna tional Joint Conference on Artificial Intelligence. Montreal, Canada, 1995:448-453.
  • 9Mario J, Stan S. Roger's thesaurus and semantic similarity// Proceedings of Conference on Recent Advances in Natural Language Processing. Borovets, Bulgaria, 2003: 212-219.
  • 10Li Yun. Mining semantic knowledge from Chinese Wikipedia [Ph. D. dissertation]. Beijing University of Posts and Telecommunications, Beijing, 2009.

共引文献39

同被引文献29

引证文献5

二级引证文献19

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部