期刊文献+

基于双语LDA的跨语言文本相似度计算方法研究 被引量:7

A cross-lingual document similarity calculation method based on bilingual LDA
下载PDF
导出
摘要 基于双语主题模型思想分析双语文本相似性,提出基于双语LDA跨语言文本相似度计算方法。先利用双语平行语料集训练双语LDA模型,再利用该模型预测新语料集主题分布,将新语料集的双语文档映射到同一个主题向量空间,结合主题分布使用余弦相似度方法计算新语料集双语文档的相似度,使用从类别间和类别内的主题分布离散度的角度改进的主题频率-逆文档频率方法计算特征主题权重。实验表明,改进后的权重计算对于基于双语LDA相似度算法的召回率有较大提高,算法对类别不受限且有较好的可靠性。 Based on the idea of bilingual topic model, we analyze similarity of bilingual documents and propose a cross-lingual document similarity calculation method based on bilingual LDA. Firstly we use the bilingual parallel documents to train the bilingual LDA model and then use the trained model to predict the topic distribution of the new corpus. The new corpus's bilingual documents are mapped to the vector space of the same topic. We use the cosine similarity method and topic distribution combined to calculate the similarity o{ the bilingual documents of the new corpus. We improve the topic frequency in- verse document frequency method from the aspect of the dispersion of in-category and the between-cate gory topic distribution, and utilize the improved method to calculate feature topic weights. Experimental results show that the improved weight calculation method can enhance the recall rate, enable the LDA similarity calculation algorithm not limited to certain categories, and it is reliable.
出处 《计算机工程与科学》 CSCD 北大核心 2017年第5期978-983,共6页 Computer Engineering & Science
基金 国家自然科学基金(61363044 61462054) 云南省科技厅面上项目(2015FB135) 云南省教育厅科学研究基金(2014Z021) 昆明理工大学省级人培项目(KKSY201403028)
关键词 双语LDA 跨语言文本相似度 余弦相似度 主题频率-逆文档频率 bilingual LDA cross-lingual document similarity calculation cosine similarity topic fre-quency-inverse document frequency
  • 相关文献

参考文献2

二级参考文献30

  • 1王燕.一种改进的K-means聚类算法[J].计算机应用与软件,2004,21(10):122-123. 被引量:9
  • 2Philip Resnik.Parallel Strands:A Preliminary Investigation into Mining the Web for Bilingual Text[A].In:Third Conference of the Association for Machine Translation in the Americas (AMTA-98)[C],Langhorne,PA,Lecture Notes in Artificial Intelligence 1529,Springer,October,1998.
  • 3Philip Resnik.Mining the Web for Bilingual Text[A].In:37th Annual Meeting of the Association for Computational Linguistics (ACL'99)[C].College Park,Maryland,June 1999.
  • 4Wessel Kraaij Jian-Yun Nie.Embedding Web-based Statistical Translation Models in Cross-Language Information Retrieval[J].Computational Linguistics 29(3):381-419 (2003).
  • 5Noah A.Smith.Detection of Translational Equivalence.Bachelor Thesis(2001)[D],University of Maryland.
  • 6Noah A.Smith.From Words to Corpora:Recognizing Translation[A].In:Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP 2002)[C],Philadelphia,Pennsylvania.
  • 7Ralf Steinberger,Bruno Pouliquen,Johan Hagman.Cross-Lingual Document Similarity Calculation Using the Multilingual Thesaurus EUROVOC[A].In:CICLing 2002[C]:415-424.
  • 8Md.Maruf Hasan and Yuji Matsumoto.Multilingual Document Alignment-A Study with Chinese and Japanese[A].In:Proceedings of the Sixth Natural Language Processing Pacific Rim Symposium (NLPRS2001)[C],Tokyo,November 2001,617-623.
  • 9Md.Maruf Hasan.Cross-language Information Retrieval,Document Alignment and Visualization -A Study with Japanese and Chinese[D].PHD thesis(2001),Nara Institute of Science and Technology.
  • 10Huaping Zhang,Qun Liu,Hao Zhang,Xueqi Cheng,Automatic Recognition of Chinese Unknown Words Based on Role[A],Tagging 19th International Conference on Computational Linguistics[C],SigHan Workshop,2002.8.

共引文献93

同被引文献50

引证文献7

二级引证文献5

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部