By examining twenty TOEFL mock writings from Shenzhen high school students,this paper attempts to have a further analysis on learner English in China with regard to its common grammatical"errors"and their po...By examining twenty TOEFL mock writings from Shenzhen high school students,this paper attempts to have a further analysis on learner English in China with regard to its common grammatical"errors"and their possible underlying causes.Some common grammatical errors include misuse of past tense,word class,existential structure,and also topic comment sentence,most of which are due to transfer of Chinese.展开更多
Identifying and correcting grammatical errors in the text written by non-native writers have received increasing attention in recent years. Although a number of annotated corpora have been established to facilitate da...Identifying and correcting grammatical errors in the text written by non-native writers have received increasing attention in recent years. Although a number of annotated corpora have been established to facilitate data-driven grammatical error detection and correction approaches, they are still limited in terms of quantity and coverage because human annotation is labor-intensive, time-consuming, and expensive. In this work, we propose to utilize unlabeled data to train neural network based grammatical error detection models. The basic idea is to cast error detection as a binary classification problem and derive positive and negative training examples from unlabeled data. We introduce an attention-based neural network to capture long-distance dependencies that influence the word being detected. Experiments show that the proposed approach significantly outperforms SVM and convolutional networks with fixed-size context window.展开更多
Due to the lack of parallel data in current grammatical error correction(GEC)task,models based on sequence to sequence framework cannot be adequately trained to obtain higher performance.We propose two data synthesis ...Due to the lack of parallel data in current grammatical error correction(GEC)task,models based on sequence to sequence framework cannot be adequately trained to obtain higher performance.We propose two data synthesis methods which can control the error rate and the ratio of error types on synthetic data.The first approach is to corrupt each word in the monolingual corpus with a fixed probability,including replacement,insertion and deletion.Another approach is to train error generation models and further filtering the decoding results of the models.The experiments on different synthetic data show that the error rate is 40%and that the ratio of error types is the same can improve the model performance better.Finally,we synthesize about 100 million data and achieve comparable performance as the state of the art,which uses twice as much data as we use.展开更多
文摘By examining twenty TOEFL mock writings from Shenzhen high school students,this paper attempts to have a further analysis on learner English in China with regard to its common grammatical"errors"and their possible underlying causes.Some common grammatical errors include misuse of past tense,word class,existential structure,and also topic comment sentence,most of which are due to transfer of Chinese.
文摘Identifying and correcting grammatical errors in the text written by non-native writers have received increasing attention in recent years. Although a number of annotated corpora have been established to facilitate data-driven grammatical error detection and correction approaches, they are still limited in terms of quantity and coverage because human annotation is labor-intensive, time-consuming, and expensive. In this work, we propose to utilize unlabeled data to train neural network based grammatical error detection models. The basic idea is to cast error detection as a binary classification problem and derive positive and negative training examples from unlabeled data. We introduce an attention-based neural network to capture long-distance dependencies that influence the word being detected. Experiments show that the proposed approach significantly outperforms SVM and convolutional networks with fixed-size context window.
基金was supported by the funds of Bejing Advanced Innovation Center for Language Resources.(TYZ19005)Research Program of State Language Commission(ZDI135-105,YB135-89).
文摘Due to the lack of parallel data in current grammatical error correction(GEC)task,models based on sequence to sequence framework cannot be adequately trained to obtain higher performance.We propose two data synthesis methods which can control the error rate and the ratio of error types on synthetic data.The first approach is to corrupt each word in the monolingual corpus with a fixed probability,including replacement,insertion and deletion.Another approach is to train error generation models and further filtering the decoding results of the models.The experiments on different synthetic data show that the error rate is 40%and that the ratio of error types is the same can improve the model performance better.Finally,we synthesize about 100 million data and achieve comparable performance as the state of the art,which uses twice as much data as we use.