摘要
经过统计发现在中文词组语料中具有字词重复特性的词组具有较高的错字率。对词组的字训重复模式进行了分类统计,统计了不同重复模式的出错率。了解到高错字率的重复模式。比如重复字词出现在词组尾部,或者出现连续性重复,则出错概率较大。基于字词重复模式的出错率数据,推荐了两种对人肌模词组语料进行人工校对的优化策略。
Statistics found that in the Chinese phrase corpus, phrase having repeated words has a high typo rate. The patterns of words repeat are classified, which indicated the error rates of different repeat patterns classified statistics. And according to the data, we learnt the repeat patterns which has those high error rates. If the repeated word appears in the phrase tail, or if there is a continuous repetition, the error probability will be higher. This paper recommend two large-scale artificial optimization strategies of proofreading the phrase corpus, based on the data of typo rate of words repeat patterns.
出处
《教学与科技》
2014年第4期38-42,共5页
Teaching and Science Technology
关键词
中文词组语料
校对策略
字词重复模式
错字率
Chinese phrase corpus
Proofreading strategies
Words repeat patterns
Typo rate