面向长文本的两阶段文本匹配模型TP-TM

TP-TM:two-phase text matching model for long-form texts

下载PDF

导出

摘要针对传统文本匹配方法无法学习文本间深度语义匹配特征,深度短文本匹配模型难以捕获长文本细粒度匹配信号等问题,提出一种面向长文本的两阶段文本匹配模型TP-TM(Two-Phase Text Matching)。首先使用句子级过滤器过滤噪声句并提取关键句,然后将所获关键句输入词语级过滤器,利用融入了改进版删减策略的BERT(Bidirectional Encoder Representations from Transformers)模型挖掘文本间深度交互特征,对关键句进行词语级噪声过滤和细粒度匹配操作,最终通过拼接BERT不同位置特征预测文本对关系。在中文长文本公开新闻数据集CNSE(Chinese News Same Event)和CNSS(Chinese News Same Story)上进行实验,结果显示,相较于基线模型,TP-TM模型在CNSE和CNSS数据集上的准确率分别提升了0.99和1.55个百分点,F1值分别提升了0.98和1.46个百分点,有效提升了长文本匹配任务的准确度。 Aiming at the problem that traditional text matching methods cannot learn the deep semantic matching features between texts,and the deep short text matching model is hard to capture the fine-grained matching signals of long texts,a two-phase text matching model for long-form texts named TP-TM(Two-Phase Text Matching)was proposed.Firstly,the sentences were fed into sentence-level filters to filter the noisy sentences and extract the key sentences;then the key sentences were fed into a word-level filter,which used the BERT(Bidirectional Encoder Representation from Transformers)model incorporating the improved pruning strategy to mine the deep interaction features between texts,and performed word-level noise filtering and fine-grained matching operations on the key sentences.Finally,the relationship between text pairs was predicted by splicing different position features of BERT.Experimental results show that the accuracy of the TP-TM model on CNSE(Chinese News Same Event)and CNSS(Chinese News Same Story)datasets increases by 0.99 and 1.55 percentage points,and the F1 value increases by 0.98 and 1.46 percentage points,respectively,proving that TP-TM model can effectively improve the accuracy of long-form text matching tasks.

作者王佳睿彭程范敏 WANG Jiarui;PENG Cheng;FAN Min(Chengdu Institute of Computer Application,Chinese Academy of Sciences,Chengdu Sichuan 610041,China;School of Computer Science and Technology,University of Chinese Academy of Sciences,Beijing 100049,China)

机构地区中国科学院成都计算机应用研究所中国科学院大学计算机科学与技术学院

出处《计算机应用》 CSCD 北大核心 2023年第S01期33-38,共6页 journal of Computer Applications

基金四川省科技计划项目(2022ZHCG0007)

关键词文本匹配长文本 BERT 过滤器特征删减 text matching long-form text BERT(Bidirectional Encoder Representation from Transformers) filter feature deletion

分类号 TP391.1 [自动化与计算机技术—计算机应用技术]