期刊文献+

结合对比学习的新闻文本与评论相似度计算

Similarity Calculation of News Texts and Comments Combined with Contrastive Learning
下载PDF
导出
摘要 新闻文本与新闻评论相似度计算旨在筛选出与新闻文本相关的评论,而大部分评论以短文本的形式对新闻文本做出评价,因此新闻文本与评论的相似度计算本质上是长文本与短文本的相似度计算.传统长文本处理方法易导致文本信息缺失、文章主题不明确等问题,降低相似度计算的准确率.针对新闻文本与评论的长度差距,结合评论的特点,该文提出了结合对比学习的新闻文本与评论相似度计算方法,该方法通过关键词的提取实现新闻文本压缩同时减少文本的冗余信息;将关键词序列与新闻标题拼接作为新闻文本的表示;然后通过BERT预训练模型使用对比学习的方法实现文本正负例的构造;最后通过交叉熵和相对熵损失函数对预训练模型进行微调,实现文本的相似度计算.实验表明,该文提出的方法较近几年的长文本处理方法在准确率上提高了3.6%,并在中文文本相似度计算的公共数据集上也取得了较好的效果. The similarity calculation between news text and news comments aims to filter out comments related to news texts,and most comments evaluate news texts in the form of short texts,so the similarity calculation between news text and comments is essentially the similarity calculation between long text and short text.Previous long text processing methods are prone to issues such as information missing and ambiguous article themes,which will reduce the accuracy of similarity calculation.As for the length gap between news texts and comments,combined with the characteristics of comments,this paper proposes a method for calculating the similarity between news texts and comments combined with contrastive learning,which compresses news texts and reduces redundant information of texts by extracting keywords;splice the keyword sequence and the news title as the representation of the news text;then generate positive and negative text samples using the BERT pre-trained model and the contrastive learning approach;Finally,to calculate text similarity,the pre-trained model is fine-tuned using the Cross Entropy and Relative Entropy loss functions.Experiments show that the method proposed in this paper improves the accuracy by 3.6%compared with the long text processing methods in recent years,and also achieves good results in the public data set of Chinese text similarity calculation.
作者 王红斌 张卓 赖华 WANG Hong-bin;ZHANG Zhuo;LAI Hua(Faculty of Information Engineering and Automation,Kunming University of Science and Technology,Kunming 650500,China;Yunnan Key Laboratory of Artificial Intelligence,Kunming University of Science and Technology,Kunming 650500,China;Yunnan Key Laboratory of Computer Technology Application,Kunming University of Science and Technology,Kunming 650500,China)
出处 《小型微型计算机系统》 CSCD 北大核心 2023年第12期2671-2677,共7页 Journal of Chinese Computer Systems
基金 国家自然科学基金项目(61966020)资助 云南省基础研究计划面上项目(CB22052C143A)资助。
关键词 文本相似度 关键词提取 BERT 对比学习 text similarity keyword extraction BERT contrastive learning
  • 相关文献

参考文献6

二级参考文献63

  • 1于津凯,王映雪,陈怀楚.一种基于N-Gram改进的文本特征提取算法[J].图书情报工作,2004,48(8):48-50. 被引量:17
  • 2Fung B C M,Wang K,Ester M.Hierarchical document clustering//Wang John ed.The Encyclopedia of Data Warehousing and Mining,idea Group.2005:970-975.
  • 3Salton G.The SMART Retrieval System-Experiments in Automatic Document Processing.Englewood Cliffs,New Jersey:Prentice Hall Inc,1971.
  • 4Wang Y,Julia H.Document clustering with semantic analysis//Proceedings of the 39th Hawaii International Conferences on System Sciences.Hawaii,US,2006:54-63.
  • 5Hotho A,Staab S,Stumme G.Wordnet improves text document clustering//Proceedings of the Semantic Web Workshop at SIGIR-2003,26th Annual International ACM SIGIR Conference.Toronto,Canada,2003:541-550.
  • 6Hall P,Dowling G.Approximate string matching.Computing Survey,1980,12(4):381-402.
  • 7Coelho T,Calado P,Souza L,Ribeiro-Neto B,Muntz R.Image retrieval using multiple evidence ranking.IEEETransactions on Knowledge and Data Engineering,2004,16(4):408-417.
  • 8Ko Y,Park J,Seo J.Improving text categorization using the importance of sentences.lnformation Processing and Management,2004,40(1):65-79.
  • 9Erkan G,Radev D.Lexrank:Graph-based lexical centrality as salience in text summarization.Journal of Artificial Intelligence Research,2004,22(7):457-479.
  • 10Theobald M,Siddharth J,Paepcke A.SpotSigs:Robust and efficient near duplicate detection in large Web collections//Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.Singapore,2008:563-570.

共引文献308

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部