摘要
针对传统文本匹配方法无法学习文本间深度语义匹配特征,深度短文本匹配模型难以捕获长文本细粒度匹配信号等问题,提出一种面向长文本的两阶段文本匹配模型TP-TM(Two-Phase Text Matching)。首先使用句子级过滤器过滤噪声句并提取关键句,然后将所获关键句输入词语级过滤器,利用融入了改进版删减策略的BERT(Bidirectional Encoder Representations from Transformers)模型挖掘文本间深度交互特征,对关键句进行词语级噪声过滤和细粒度匹配操作,最终通过拼接BERT不同位置特征预测文本对关系。在中文长文本公开新闻数据集CNSE(Chinese News Same Event)和CNSS(Chinese News Same Story)上进行实验,结果显示,相较于基线模型,TP-TM模型在CNSE和CNSS数据集上的准确率分别提升了0.99和1.55个百分点,F1值分别提升了0.98和1.46个百分点,有效提升了长文本匹配任务的准确度。
Aiming at the problem that traditional text matching methods cannot learn the deep semantic matching features between texts,and the deep short text matching model is hard to capture the fine-grained matching signals of long texts,a two-phase text matching model for long-form texts named TP-TM(Two-Phase Text Matching)was proposed.Firstly,the sentences were fed into sentence-level filters to filter the noisy sentences and extract the key sentences;then the key sentences were fed into a word-level filter,which used the BERT(Bidirectional Encoder Representation from Transformers)model incorporating the improved pruning strategy to mine the deep interaction features between texts,and performed word-level noise filtering and fine-grained matching operations on the key sentences.Finally,the relationship between text pairs was predicted by splicing different position features of BERT.Experimental results show that the accuracy of the TP-TM model on CNSE(Chinese News Same Event)and CNSS(Chinese News Same Story)datasets increases by 0.99 and 1.55 percentage points,and the F1 value increases by 0.98 and 1.46 percentage points,respectively,proving that TP-TM model can effectively improve the accuracy of long-form text matching tasks.
作者
王佳睿
彭程
范敏
WANG Jiarui;PENG Cheng;FAN Min(Chengdu Institute of Computer Application,Chinese Academy of Sciences,Chengdu Sichuan 610041,China;School of Computer Science and Technology,University of Chinese Academy of Sciences,Beijing 100049,China)
出处
《计算机应用》
CSCD
北大核心
2023年第S01期33-38,共6页
journal of Computer Applications
基金
四川省科技计划项目(2022ZHCG0007)
关键词
文本匹配
长文本
BERT
过滤器
特征删减
text matching
long-form text
BERT(Bidirectional Encoder Representation from Transformers)
filter
feature deletion