期刊文献+

基于字分类的中文分词的研究 被引量:10

Chinese Word Segmentation Research Based on Classification of Words
下载PDF
导出
摘要 中文分词是自然语言处理的前提和基础,利用基于字分类的方法实现中文分词,就是将中文分词想象成字分类的过程。把字放入向前向后相邻两个字这样的一个语境下根据互信息统计将字分成四种类别,即跟它前面结合的字,跟它后面结合的字,跟它前后结合的字,独立的字。在分词的过程中采用了t-测试算法,一定程度上解决了歧义问题。以人民日报为语料库进行训练和测试,实验结果表明,该方法能够很好地处理歧义问题,分词的正确率达到了90.3%,有了明显的提高。 Chinese word segmentation is the premise and foundation of natural language processing,which is realized by mutual statistics principles.Imagining Chinese word segmentation as the process of characters classification and putting a character into certain context,the category of the character can be identified.Based on mutual statistics principles,classified characters into four categories: a character connects with the left one,a character connects with the right one,a character in the middle of the other two and an independent character.Applying to t-test algorithm in the process of segmentation,some ambiguity problems are solved.Taking People Daily as the corpus of training and testing,this experiment shows that ambiguity problems are better solved and the accuracy of word segmentation reached 90.3% and improved significantly.
出处 《计算机技术与发展》 2011年第7期29-31,35,共4页 Computer Technology and Development
基金 云南省自然科学基金(2007F174M) 云南大学研究生科研课题资助项目(ynny200928)
关键词 中文分词 互信息 t-测试 分类 Chinese word segmentation mutual information t-test categorization
  • 相关文献

参考文献12

二级参考文献52

共引文献114

同被引文献68

引证文献10

二级引证文献32

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部