摘要
跨语言文本主题发现是跨语言文本挖掘领域的重要研究方向,对跨语言文本分析和组织各种文本数据具有较高的应用价值。基于Bagging和跨语言词嵌入改进LDA主题模型,提出跨语言文本主题发现方法BCL-LDA(Bagging,Cross-lingual word embedding with LDA),从多语言文本中挖掘关键信息。该方法首先将Bagging集成学习思想与LDA主题模型结合生成混合语言子主题集;然后利用跨语言词嵌入和K-means算法对混合子主题进行聚类分组;最后使用TF-IDF算法对主题词进行过滤排序。汉语-德语、汉语-法语主题发现实验表明,该方法在主题连贯性和多样性方面均表现优异,能够提取出语义更加相关且主题更加连贯多样的双语主题。
Cross-lingual text topic discovery is an important research direction in the field of cross-lingual text mining,and it has high application value for cross-lingual text analysis and organization of various text data.Based on Bagging and cross-lingual word embedding to improve the LDA topic model,a cross-lingual text topic discovery method BCL-LDA(Bagging,cross-lingual word embedding with LDA)is proposed to mine key information from multilingual text.This method first combines the Bagging integrated learning idea with the LDA topic model to generate a mixed language subtopic set.Then it uses cross-lingual word embedding and K-means algorithm to cluster and group the mixed subtopics.Finally,the TF-IDF algorithm is used to filter and sort the subject words.The Chinese-German and Chinese-French topic discovery experiments show that this method performs well in terms of topic coherence and diversity,and can extract bilingual topics with more relevant semantics and more coherent and diverse topics.
作者
李帅
于娟
巫邵诚
LI Shuai;YU Juan;WU Shaocheng(School of Economics and Management,Fuzhou University,Fuzhou 350108,China)
出处
《计算机科学》
CSCD
北大核心
2024年第S01期182-189,共8页
Computer Science
基金
国家自然科学基金(71771054,72171090)。
关键词
主题发现
跨语言
LDA
主题聚类
德语
法语
Topic discovery
Cross-lingual
LDA
Topic clustering
German
French