期刊文献+

基于BERT的嵌入式文本主题模型研究 被引量:6

Research on Embedded Text Topic Model Based on BERT
下载PDF
导出
摘要 主题模型能够从海量文本数据中挖掘语义丰富的主题词,在文本分析的相关任务中发挥着重要作用。传统LDA主题模型在使用词袋模型表示文本时,无法建模词语之间的语义和序列关系,并且忽略了停用词与低频词。嵌入式主题模型(ETM)虽然使用Word2Vec模型来表示文本词向量解决上述问题,但在处理不同语境下的多义词时,通常将其表示为同一向量,无法体现词语的上下文语义差异。针对上述问题,设计了一种基于BERT的嵌入式主题模型BERT-ETM进行主题挖掘,在国内外通用数据集和《软件工程》领域文本语料上验证了所提方法的有效性。实验结果表明,该方法能克服传统主题模型存在的不足,主题一致性、多样性明显提升,在建模一词多义问题时表现优异,尤其是结合中文分词的WoBERT-ETM,能够挖掘出高质量、细粒度的主题词,对大规模文本十分有效。 Topic model can mining topic words with rich semantics from the massive text data,and plays an important role in the related tasks of text analysis.When the traditional LDA topic model uses word-bag model to represent text,it cannot model the semantic and sequence relationship between words,and ignore the words of deactivation and low frequency.Although the embedded topic model(ETM)solves the above problems by using Word2Vec model to represent the word vector of text,it usually represents the same vector when dealing with polysemy words in different contexts,which cannot reflect the semantic differences of words.To solve the above problems,a kind of ETM based on BERT named BERT-ETM is designed to mine the topic.The effectiveness of the proposed method is verified in general datasets at home and abroad and the text corpus of software engineering.The experimental results show that the method can overcome the shortcomings of traditional topic models,and the coherence and diversity of topic are improved obviously and performs well in modeling polysemy of a word,especially WoBERT-ETM combined with Chinese word segmentation,can dig out high-quality and fine-grained topic words,which is very effective for large vocabulary.
作者 王宇晗 林民 李艳玲 赵佳鹏 WANG Yuhan;LIN Min;LI Yanling;ZHAO Jiapeng(College of Computer Science and Technology,Inner Mongolia Normal University,Hohhot 010022,China;School of Cyber Security,University of Chinese Academy of Sciences,Beijing 100089,China;Institute of Information Engineering,Chinese Academy of Sciences,Beijing 100089,China)
出处 《计算机工程与应用》 CSCD 北大核心 2023年第1期169-179,共11页 Computer Engineering and Applications
基金 国家自然科学基金(61806103,61562068) 内蒙古自然科学基金(2017MS0607) 内蒙古自治区科技计划项目(JH20180175) 信息安全242课题(2019A114)。
关键词 主题模型 BERT模型 词嵌入 词向量可视化 topic model BERT model word embedding word vector visualization
  • 相关文献

参考文献8

二级参考文献57

共引文献173

同被引文献81

引证文献6

二级引证文献2

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部