摘要
中文文本正则化是把非汉字字符串转化为汉字串以确定其读音的过程。该工作的难点:一是正则化对象——非汉字串形式复杂多样,难于归纳;二是非汉字串有歧义,需要消歧处理。文章引入非标准词的概念对非汉字串进行有效归类,提出非标准词的识别、消歧及标准词生成的三层正则化模型。在非标准词的消歧中引入机器学习的方法,避免了复杂规则的书写。实验表明,此方法取得了很好的效果,并具有良好的推广性,开放测试的正确率达到98.64%。
Chinese text normalization is the process of transforming non-Chinese character strings into their corresponding Chinese character strings to determine their pronunciations. The difficulties of this work mainly lie in two aspects: too many non-Chinese character strings of various formats and their high degree of ambiguities. This paper develops an effective taxonomy of non-Chinese character strings with the concept of Non-Standard Words (NSWs). And then a three-layer normalization model is proposed, including NSWs detection, NSWs disambiguation and standard words generation. In the NSWs disambiguation stage, a machine learning method is employed to overcome shortcomings of rule-based method. Experiment results show that this approach achieves a high performance and adapts well to new domains. The accuracy of open test is 98.64%.
出处
《中文信息学报》
CSCD
北大核心
2008年第5期45-50,55,共7页
Journal of Chinese Information Processing
基金
国家973课题资助项目(2004CB318102)
关键词
计算机应用
中文信息处理
文本正则化
语音合成
最大熵模型
computer application
Chinese information processing
text normalization
text-to-speech
maximum entropy model