摘要
汉字内码向ISO/IEC 10 6 46过渡是实现计算机用文字编码统一的必然趋势 ,但目前在一段时间内仍将存在多种汉字内码并存的情况 ,所以实现汉字内码的自动识别是保证汉字多内码并存的关键。本文主要探讨了如何在多内码并存的多文种环境中实现汉字内码自动识别的问题 ,并提供了多种汉字内码识别算法 ,包括基于内码分布、标点符号特征、字频特征和语义特征的识别算法等。在此基础上 ,本文对不同的识别算法进行分析和评估。在对目标样本的测试中 ,以上算法的识别率最高可以达到 99 9%以上。
Its a general tendency that the Han Character Internal Codes used in computer should transfer to ISO/ IEC 10646, but there are multi-Han Character Internal Codes used in the computer now, and this instance will stand a long time. So how to realize the Han Character Internal Codes auto recognition is the key to build a Multi-lingual Environment. This paper mainly discusses the Han Character Internal Codes recognition algorithms in the Multi-lingual Environment, and provides four recognition algorithms, such as Internal Code Bound Recognition Algorithm, Interpunction Recognition Algorithm, Han Character Frequency Recognition Algorithm and Semantic Recognition Algorithm. This paper also evaluates the algorithms mentioned in this paper, and the rate of Recognition can reach 99.9% used these recognition algorithms on the test documents.
出处
《中文信息学报》
CSCD
北大核心
2004年第2期73-79,共7页
Journal of Chinese Information Processing
基金
江苏省高校自然科学基金项目资助 (0 1kjb5 2 0 0 0 1)
关键词
计算机应用
中文信息处理
多文种环境
汉字内码
识别算法
computer application
Chinese information processing
multi-lingual environment
han character internal code
recognition algorithm