摘要
在哈萨克语文本非词查错方面,归纳和总结查错方法,在一定规模的哈萨克语词库的支持下,利用哈萨克语的特点,用哈萨克语词干切分程序和哈萨克语的音节规则,从文本中找出非词错误,再用最小编辑距离算法提供最有可能的候选词。在哈萨克语文本真词查错部分,根据上下文信息,采用基于N-gram的语言模型,利用文本的局部连接同现概率三元语法模型来进行真词查错,再用基于编辑距离的模式匹配方法对真词错误提供纠错建议。实验结果表明,系统的查错与纠错效率较好,实验方案是可行的。
For the section of non-word errors checking in Kazakh text,on the basis of summarising and concluding the errors checking methods and supported by a certain size Kazakh lexicon,in the article we use the characteristics of Kazakh and the stem segmentation program and syllable rules of Kazakh language to find the non-word errors from the text,and then provide the most possible candidate word with minimum edit distance algorithm.In the section of real-word error checking in Kazakh text,according to context information and adopting N-gram based language model,we carry out real-word error checking by using ternary grammar model of local connection co-occurrence probability of the text,and then use the edit distance-based pattern matching method to provide error-correction suggestions to the errors of real words.Experimental results show that efficiency of error checking and error correction of this system is fairly good,the experiment scheme is feasible.
出处
《计算机应用与软件》
CSCD
北大核心
2012年第4期9-12,15,共5页
Computer Applications and Software
基金
国家自然科学基金项目(60763005)
国家教育部
国家语委民族语言文字规范标准建设及信息化科研项目(MZ115-92)
关键词
文本自动校对
哈萨克语
最小编辑距离
N元语法
模式匹配
Automatic text proofreading Kazakh language Minimum edit distance algorithm N-gram Pattern matching