摘要
错别字自动识别是自然语言处理中一项重要的研究任务,在搜索引擎、自动问答等应用中具有重要价值。尽管传统方法在识别文本中多字词错误方面的准确率较高,但由于中文单字词错误具有特殊性,传统方法对中文单字词检错准确率较低。该文提出了一种基于Transformer网络的中文单字词检错方法。首先,该文通过充分利用汉字混淆集和Web网页构建中文单字词错误训练语料库。其次,在实际测试过程中,该文对实际的待识别语句采用滑动窗口方法,对每个滑动窗口中的句子片段分别进行单字词检错,并且综合考虑不同窗口的识别结果。实验表明,该方法具有较好的实用性。在自动生成的测试集上,识别准确率和召回率分别达到83.6%和65.7%;在真实测试集上,识别准确率和召回率分别达到82.8%和61.4%。
Typo automatic detection is an important research task in natural language processing. It has important value in search engine, automated Q&A, etc. Although the accuracy of traditional methods for recognizing muliti-word typos in Chinese text is relatively high. However, due to the particularity of Chinese single word error, these methods generally have low accuracy. This paper proposes a method to identify Chinese single word error using a Transformer network. Firstly, In this paper, we make full use of Chinese character confusion set and web pages to build a Chinese single word error training corpus. Secondly, during actual testing process, the sliding window method is adopted for the actual sentences to be identified, single word error detection is performed for each sentence segment in each sliding window, and the recognition results of each window are comprehensively considered. Experiments show that the method in this paper has better practicability. Experimental results indicate that our method achieves a precision rate of 83.6% and a recall rate of 65.7% on an artifical test set, and a precision rate of 82.8% and a recall rate of 61.4% respectively on a real test set.
作者
曹阳
曹存根
王石
CAO Yang;CAO Cungen;WANG Shi(Key Laboratory of Intelligent Information Processing,Institute of Computing Technology,Chinese Academy of Sciences,Beijing 100190,China;University of Chinese Academy of Sciences,Beijing 100049,China)
出处
《中文信息学报》
CSCD
北大核心
2021年第1期135-142,共8页
Journal of Chinese Information Processing
基金
国家重点研发计划(2017YFC1700300,2017YFB1002300)。