摘要
为提高医药文献中文分词的准确率,根据医药文献的特点,研究了中文分词的算法。首先介绍了基于字符串匹配的分词方法、基于理解的分词方法、基于统计的分词方法和基于匹配与统计结合分词方法,并在设计思想上对各算法进行了比较。在此基础上,运用C语言,VC6.0平台实现各算法,并对医药文献进行分词实验。实验结果显示,基于字符串匹配的最大正向匹配法取得了较好的性能。
To improve the accuracy of Chinese word segmentation on medical documents,the Chinese segmentation methods based on the characteristics of the medical literature are studied.Firstly,the word segmentation algorithms based on string matching,based on seman tic,based on statistics and based on combining string matching with statistics are respectively introduced.Then,the comparisons of algo rithm ideas are made for various algorithms.The system is realized with C language on VC6.0 platform.The experiment results show that the Forward Maximum Matching algorithm based on string matching makes better performance.
出处
《电脑知识与技术(过刊)》
2012年第6X期4138-4140,4151,共4页
Computer Knowledge and Technology
基金
广东省大学生创新实验项目(1057310031)
关键词
医药文献
中文分词
字符串匹配
medical documents
Chinese word segmentation
string matching