期刊文献+

面向新媒体领域的错别字自动校对 被引量:3

Automatic Proofreading of Wrong Characters for New Media Field
下载PDF
导出
摘要 新媒体平台每天原创新闻发布量巨大,采用人工审核内容中的错别字已经不切实际。本文提出了一种基于n-gram模型与规则相结合的方法,采集上亿篇新闻文章作为训练语料,对分词后的语料进行统计分析形成三元n-gram模型库,基于上下文语境构建错别字混淆集,通过最优化方法计算混淆词在目标场景中的支持度,有效实现错别字的自动检查与纠错。实验结果显示,文章查错召回率为78.9%,准确率为85.1%,具有重要的实际意义和广泛的应用领域。 Every day, a huge amount of original news is released in new media platform, so it is unrealistic to manually check the wrong characters in the audited content. In this paper, a method based on N-gram model and rules is proposed to collect hundreds of millions of news articles as training corpus. The corpus after word segmentation is statistically analyzed to form a ternary N-gram model library. The confusion set is constructed based on context. The support of confusion words in target scene is calculated by optimization method. Automatically checking and correcting errors. The experimental results show that the recall rate of error detection is 78.9%, and the accuracy rate is 85.1%. It has important practical significance and wide application fields.
作者 龚永罡 汪昕宇 付俊英 王蕴琪 GONG Yong-gang;WANG Xin-Xu;FU Jun-xing;WANG Yun-qi
出处 《信息技术与信息化》 2018年第10期73-75,共3页 Information Technology and Informatization
关键词 N-GRAM模型 混淆集 支持度 错别字 N-gram model confusing set support degree wrongly written character
  • 相关文献

参考文献1

二级参考文献14

共引文献6

同被引文献46

引证文献3

二级引证文献1

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部