摘要
中文的真词错误类似于英文的真词错误,指一个中文词错成另一个词典中的词。提出一种基于混淆集的真词错误发现方法,通过对目标词的局部特征的提取,形成局部左邻接二元、右邻接二元及3个三元特征,然后通过和目标词对应的混淆集中的混淆词来估计二元概率和三元概率。最后提出一种多特征融合的模型,然后利用规则来判断中文文本中的真词错误。将查错结果分为标记错误和更改错误两种类型,采用18组混淆集,构造2万行的测试语料进行实验。实验表明,该方法能有效地发现中文文本中的真词错误,并且能给出真词错误的修改建议。该方法是一种集自动查错和自动纠错于一体的中文文本自动校对方法。
Similar to the English context-sensitive spelling correction,real-word error in Chinese refers to the error that a Chinese word is misused to another Chinese Word. In the paper, a Chinese real word error detection and correction method based on confusion sets was proposed. This method extracts local feature around the aim word which forms left adjacent bigram, right adjacent bigram and three trigrams. The probability of bigram and trigram are computed with the confusion words in the aim word's confusion set. A model based on multi-feature fusion was proposed and rules was used to find the real-word errors. We classified the result into two types, marking the errors and rewriting the errors. In the experiment,we used 18 group confusion sets and built 20000 sentences corpus to validate the algorithm. The results show that the proposed method can find the real-word errors in Chinese texts and give the correction lists. The proposed method combines automatic error-detecting and automatic error-correction.
出处
《计算机科学》
CSCD
北大核心
2016年第12期30-35,共6页
Computer Science
基金
国家自然科学基金项目(91224006
61173063
61035004
61203284
30973713)
国家社科基金重点项目(10AYY003)资助