摘要
随着学习者语料库建设规模的不断扩大,语料预处理的自动化需求也愈发迫切。拼写检查是语料预处理中的重要步骤,是后续语料检索及统计分析得以准确进行的前提条件。现有通用自动拼写检查工具并不适于学习者语料库建设。同时,由于学习者拼写错误标注语料数量有限,有监督深度学习模型无法得到应用。针对上述问题,该研究将词向量技术应用于自动拼写检查,结合编辑距离计算以及N-Gram语言模型,设计和构建面向大规模英语学习者语料库建设的自动拼写检查系统。数据测评结果表明,通过词向量增强的自动拼写检查系统在各项主要指标上均超过现有开源自动拼写检查工具,计算效率也能满足语料预处理应用需要。
Recently,a growing number of learner English writings become publicly available for building large-scale learner corpora,creating an urgent need for various automatic textual processing tools to assist and accelerate corpus construction.In corpus studies,spell check usually serves as one of the important procedures to cleanse texts before running any corpus searches,retrievals and statistical analyses.However,current automatic spell-check tools available to the public have been mostly developed for general purpose and not suitable for learner corpus preprocessing.The performance of these tools on the learner English texts is far from satisfactory.Learner written English features a relatively restricted range of vocabulary range and some mis-spelling behaviors typical to second language learners.Unfortunately,it is not feasible at present to improve the system by adopting any supervised deep learning models due to a lack of adequate human annotated spelling error data.To solve these problems,this study introduced the word embedding model into the design of a spell check system.Word embedding is a recent language modeling technique in Natural Language Processing.A word embedding model learns the meaning vector of each word in the corpus based on its contextual distributional information.A mis-spelled word and its intended counterpart share a lot in their contexts of use and hence,are similar in meaning and near each other in the vector space of the word embedding model.Therefore,word embedding models can be used to rank the corrected candidates for mis-spelled words by computing their similarity.By combining the word embedding model along with edit-distance computation and N-Gram language models,this study designed and constructed an automatic spell check system specifically for large scale learner English corpus preprocessing.The system consists of two major components.One is spelling error detecting module,and the other spelling error correcting module.The former detects non-word errors by matching words in English dictionary,filters Chinese pinyin(phonetic transcriptions of Chinese characters)and identifies irregular verb mistakes or run-on errors.The latter module is further divided into two functional steps,including generating and sorting candidate word list.The method to generate candidate word list is mainly based on traditional minimal edit-distance computing approach but with a larger tolerance for edit-distances.Candidate word list sorting is conducted through a series of scoring operations.A candidate word is firstly graded by its frequency in learner English word list,then scored by the similarity measures in the word embedding model trained on a large collection of learner English texts,and finally checked by an n-gram model generated from the reference corpus to verify its contextual possibility in normal usage.The word ranking top in the sorted candidate word list is selected as the spell check correction.The performance of the system is evaluated against six open-sourced spell check tools with major evaluative indicators such as precision,recall,F1 score and speed.Evaluation results demonstrated that the word-embedding enhanced automatic spell check system outperforms the other tools in precision,recall and F1 score,while beating four of the six tools in speed.
作者
梁茂成
邓海龙
LIANG Mao-cheng;DENG Hai-long(School of Foreign Languages,Beihang University,Beijing 100083,China;National Research Center for Foreign Language Education,Beijing Foreign Studies University,Beijing 100089,China/School of Foreign Languages,Gannan Normal University,Ganzhou,Jiangxi 341000,China)
出处
《外语电化教学》
CSSCI
北大核心
2020年第1期31-37,5,共8页
Technology Enhanced Foreign Language Education
基金
国家社科基金项目“基于深度学习方法的语料库索引向量化与自动聚类研究”(项目编号:19BYY082)的阶段性研究成果。
关键词
英语学习者语料库
自动拼写检查
词向量
Learner English Corpora
Automatic Spelling Correction
Word Embeddings