Statistical language modeling techniques are investigated so as to construct a language model for Chinese text proofreading. After the defects of n-gram model are analyzed, a novel statistical language model for Chine...Statistical language modeling techniques are investigated so as to construct a language model for Chinese text proofreading. After the defects of n-gram model are analyzed, a novel statistical language model for Chinese text proofreading is proposed. This model takes full account of the information located before and after the target word wi, and the relationship between un-neighboring words w_i and w_j in linguistic environment(LE). First, the word association degree between w_i and w_j is defined by using the distance-weighted factor, w_j is l words apart from w_i in the LE, then Bayes formula is used to calculate the LE related degree of word w_i, and lastly, the LE related degree is taken as criterion to predict the reasonability of word w_i that appears in context. Comparing the proposed model with the traditional n-gram in a Chinese text automatic error detection system, the experiments results show that the error detection recall rate and precision rate of the system have been improved.展开更多
文摘Statistical language modeling techniques are investigated so as to construct a language model for Chinese text proofreading. After the defects of n-gram model are analyzed, a novel statistical language model for Chinese text proofreading is proposed. This model takes full account of the information located before and after the target word wi, and the relationship between un-neighboring words w_i and w_j in linguistic environment(LE). First, the word association degree between w_i and w_j is defined by using the distance-weighted factor, w_j is l words apart from w_i in the LE, then Bayes formula is used to calculate the LE related degree of word w_i, and lastly, the LE related degree is taken as criterion to predict the reasonability of word w_i that appears in context. Comparing the proposed model with the traditional n-gram in a Chinese text automatic error detection system, the experiments results show that the error detection recall rate and precision rate of the system have been improved.