摘要
中文文本校对是自然语言处理领域重要课题,在汉语校对中,文本错误有很多种,其中同音词错误占很大的比例,文中提出一种基于决策列表的方法,首先手工整理出常见的1000对同音词混淆集,通过大量语料训练出2元模型和上下文语境模型,校对文本时提取词以及它所有同音词的2元特征和上下文特征,根据训练好的模型计算出支持度,这就是同音词组决策列表的构建,从决策列表中判断哪个词的支持度最高,从而实现同音词自动查错与纠错。最后,为了改善由于数据稀疏带来的问题,文章用同义词聚类对实验进行了改进,提高了召回率等。根据实验,这种方法能有效的解决同音词错误。
Chinese text automatic proofreading is an important topic in the field of natural language processing. In Chinese test proofreading, there are many kinds of errors, homophone errors account for a large proportion. In this paper, proposes a method based on decision list, at first, we sort out 1000 pairs of homophone sets, secondly we train 2-gram models and context models through a large scale of corpus.When we proofread text,extract 2-gram and context feature of a word in the homophone confusion sets and its homophone.According to the models calculate model support, namely decision list is constructed.We can judge the hightest model support and appropriate homophone, thus implement automatic detection and correction of homophone errors. At last, we improve the experiment with synonyms clustering, improve the problem brought by rarefaction of data, improve recall rate. According to experiments, we can prove this method can slove homophone errors effectively.
出处
《电子设计工程》
2015年第9期39-41,共3页
Electronic Design Engineering
基金
人工智能四川省重点实验室开放基金(2012RYJ04)
中科院智能信息处理重点实验室开放课题(IIP2013-1)