摘要
基于大规模领域内标注数据训练的句法分析模型在领域外数据上测试时,性能会急剧下降.导致该现象的原因之一是缺乏高质量的目标领域标注数据.由于人工标注数据耗时耗力,自动生成目标领域标注数据是一种有效的解决方法.其中,三元训练(tri-training)作为一种典型的多模型决策协同训练方法,旨在利用多个模型的预测结果来保证自动标注数据的质量.本文针对跨领域依存句法分析任务,系统比较了3种常用的tri-training方法,在NLPCC-2019评测数据集上取得了目前最佳的性能,并大幅度超过了目前最好结果.此外,还设计了详细的分析实验以深入理解跨领域模型性能下降的原因以及tri-training所起的作用.
The performance regresses sharply when a parser is trained on a large-scale in-domain labeled data and is tested on out-of-domain data.One of the reasons lies in the lack of high-quality target domain labeled data.Due to time-consuming and labor-intensiveness of manually-annotating data,automatically generating target domain labeled data is regarded as effective ways for remedying these shortcomings.As a typical multi-model decision-making collaborative training method,tri-training is designed to ensure the quality of auto-labeled data by utilizing the prediction results of multiple models.In this study,we compare three typical tri-training methods systematically for cross-domain dependency parsing.Our model achieves the state-of-the-art results on NLPCC-2019 shared task datasets,and outperforms other models by a large margin.Finally,we have conducted detailed analyses to gain insights on reasons of performance regressing in cross-domain models and the effect of tri-training.
作者
李帅克
李英
李正华
张民
LI Shuaike;LI Ying;LI Zhenghua;ZHANG Min(School of Computer Science&Technology,Soochow University,Suzhou 215006,China)
出处
《厦门大学学报(自然科学版)》
CAS
CSCD
北大核心
2022年第4期638-645,I0001,共9页
Journal of Xiamen University:Natural Science
基金
国家自然科学基金(61876116)
江苏高校优势学科建设工程资助项目。
关键词
三元训练
领域适应
依存句法
tri-training
domain adaptation
dependency parsing