期刊文献+

BERT-Single:半监督的话题检测与追踪方法 被引量:1

BERT-Single:semi-supervised method for topic detection and tracking
下载PDF
导出
摘要 针对无监督聚类方法在应用于话题检测与追踪任务时难以学习到深层语义特征及任务相关特征,K均值聚类、潜在狄利克雷分布(LDA)等方法无法用于增量式聚类的问题,提出基于预训练语言模型的BERT-Single半监督算法。首先使用小规模有标注数据训练预训练语言模型BERT,使BERT模型学习到任务特定的先验知识,生成能够适应话题检测与追踪任务且包含深层语义特征的文本向量;然后利用改进的Single-Pass聚类算法将预训练语言模型学习到的有标签样本信息泛化到无标签数据上,提升模型在话题检测与追踪任务上性能。在构建的数据集上进行实验,结果显示,相较于对比模型,BERT-Single模型精确率至少提升了3个百分点、召回率至少提升了1个百分点、F1值至少提升了3个百分点。BERT-Single模型对于解决话题检测与追踪问题具有较好效果,并能够很好地适应增量式聚类任务。 At present,it is difficult to learn deep semantic features and task-related features when unsupervised clustering applied to topic detection and tracking tasks,and K-means clustering and Latent Dirichlet Allocation(LDA)methods can not be applied to incremental clustering.A semi-supervised BERT-Single algorithm based on pre-trained language model was proposed.Firstly,the pre-trained language model BERT was trained by small-scale labeled data to learn task-specific prior knowledge,and was used to generate text vectors suitable to topic detection and tracking tasks and containing deep semantic features.Then,an improved Single-Pass clustering algorithm was used to generalize the labeled sample information learned from the pretrained language model to the unlabeled data to improve the performance of the model in topic detection and tracking tasks.According to the experimental results on the constructed data set,compared with comparison models,the accuracy of BERT-Single model increased by 3 percentage points,recall increased by 1 percentage points,and F1 value increased by 3 percentage points.The BERT-Single model can solve the problems of topic detection and tracking well,and it can adapt to the incremental clustering tasks well.
作者 侯博元 崔喆 谢欣冉 HOU Boyuan;CUI Zhe;XIE Xinran(Chengdu Institute of Computer Application,Chinese Academy of Sciences,Chengdu Sichuan 610041,China;School of Computer Science and Technology,University of Chinese Academy of Sciences,Beijing 100049,China)
出处 《计算机应用》 CSCD 北大核心 2022年第S01期21-27,共7页 journal of Computer Applications
基金 四川省科技计划项目(2020YFG0009) 四川省重大科技专项(2019ZDZX0005)。
关键词 聚类 半监督学习 话题检测与追踪 预训练语言模型 新闻话题 clustering semi-supervised learning Topic Detection and Tracking(TDT) pre-training language model news topic
  • 相关文献

参考文献8

二级参考文献70

  • 1陈兴蜀,吴小松,王文贤,王海舟.基于特征关联度的K-means初始聚类中心优化算法[J].四川大学学报(工程科学版),2015,47(1):13-19. 被引量:29
  • 2赵华,赵铁军,于浩,张姝.面向动态演化的话题检测研究[J].高技术通讯,2006,16(12):1230-1235. 被引量:17
  • 3徐晓日.网络舆情事件的应急处理研究[J].华北电力大学学报(社会科学版),2007(1):89-93. 被引量:141
  • 4Blei David,Ng Andrew,Jordan Michael.Latent Dirichlet Allocation[J].The Journal of Machine Learning Research,2003,3:993-1022.
  • 5Rosen-Zvi M,Griffiths T,Steyvers M,et al.The author-topic model for authors and documents[C]//Proceedings of the 20th conference on uncertainty in artificial intelligence.AUAI Press,2004:487-494.
  • 6Ruifeng XU,Lu YE.Reader's Emotion Prediction Based on Weighted Latent Dirichlet Allocation and Multi-label k-nearest Neighbor Model[J].Journal of Computational Information System,2013,9:6.
  • 7Johri N,Roth D,Tu Y.Experts' retrieval with multiword-enhanced author topic model.Proceedings of the NAACL HLT 2010 workshop on semantic search[C]//Proceedings of Association for Computational Linguistics,2010:10-18.
  • 8William Darling,Fei Song.Probabilistic Topic and Syntax Modeling with Part-of-Speech LDA[C]//Proceedings of Association for Computational Linguistics.2005.
  • 9Griffiths T L,Steyvers M,Blei D M,et al.Integrating topics and syntax[J].Advances in neural information processing systems,2005,17:537-544.
  • 10Allison J.B.Chaney,David M.Blei.Visualizing Topic Models[C]//Proceedings of Association for the Advancement of Artificial Intelligence.2012.

共引文献54

同被引文献9

引证文献1

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部