摘要
针对中文文本纠错领域中训练深度学习模型所需要的标注数据有限这一问题,提出了五种数据噪声替换方案。通过实验验证,证明了其中的音似替换和形似替换两种方案可以有效增强该领域数据质量,然后通过对这两种替换方案的对比实验,探索出了一种更有效的混合替换方案。其核心在于通过噪声替换的方式提升现有数据集的大小和多样性,进而提高中文拼写纠错模型的性能。
Due to the limitation of the label data needed to train the deep learning model in the field of Chinese text error correction,five data noise replacement schemes are proposed.Experiment proves that the sound similarity replacement and form similarity can effectively enhance the data quality in the area.Then a more effective hybrid alternative scheme is explored through the comparative experiment of the two alternatives.The core of this method is to improve the performance of Chinese spelling correction model by increasing the size and diversity of existing data sets by means of noise substitution.
作者
李建义
白雪丽
王洪俊
王迦南
Li Jianyi;Bai Xueli;Wang Hongjun;Wang Jianan(School of Computer Science&Engineering,North China Institute of Aerospace Engineering,Langfang 065000,China;TRS Information Technology Co.,Ltd.,Beijing 100000,China)
出处
《北华航天工业学院学报》
CAS
2021年第6期1-4,44,共5页
Journal of North China Institute of Aerospace Engineering
基金
河北省自然科学基金项目(F2019409056)。
关键词
中文拼写纠错
深度学习
标注数据
噪声替换
数据增强
Chinese spelling correction
deep learning
label data
noise substitution
data enhancement