摘要
随着时代的发展,网络中文本数量飞速增长,为了高效地提取和处理,对文本进行分类必不可少。该文以BERT模型为基础,提出了一种子词级的中文文本分类方法。在该方法中,使用子词级遮蔽方法改进原有遮蔽语言模型,使其能有效遮蔽完整中文单词,增加了BERT模型对中文文本的词向量表达能力。同时新加入了中文单词位置嵌入,弥补了BERT模型对中文单词位置信息的缺失。实验结果表明,使用了该文文本分类方法的BERT模型,在多个中文数据集中对比其他模型均拥有最好的分类效果。
With the development of the times, the number of text in the network is growing rapidly. In order to extract and process the text efficiently, it is necessary to classify the text. Based on the BERT model, this paper proposes a Chinese text classification method at the seed word level. In this method, the subword-level masking method is used to improve the original masking language model, so that it can effectively mask the complete Chinese words, and increase the word vector expression ability of BERT model for Chinese text. At the same time, Chinese word position embedding is added to make up for the lack of Chinese word position information in BERT model. The experimental results show that the BERT model of this text classification method has the best classification effect compared with other models in multiple Chinese data sets.
出处
《计算机科学与应用》
2020年第6期1075-1086,共12页
Computer Science and Application