摘要
针对目前医药信息文本分类领域的现状,设计并实现了一种基于KNN算法的医药信息文本分类系统。该系统充分利用了向量空间模型在表示方法上的优势和快速KNN算法的特点,并采用逆向最大匹配分词方法进行分词,可有效提高医药信息分类的准确性和信息处理效率。此外,构建了一个医药信息数据集,该数据集包含582篇医药类文本,其中训练文本433篇,测试文本149篇,并在该数据集上对医药信息文本分类系统进行了测试,得到了74.83%的F1值。实验证明,该系统可以较好地实现医药信息文本分类。
Designs and implements a system of medical information text categorization based on KNN algorithm. This system uses the vector space model to represent a text, uses the fast KNN algorithm to classify a text, and uses the reverse maximum match to segment the words. Therefore, it improves the accuracy of medical information classification and the efficiency of information processing. In addition, constructs a dataset of medical information including 582 medical documents, which is randomly divides into a training set including 433 documents and 149 documents. The system of medical information text classification is tested on our dataset and a F1 score of 74.83% is obtained. The result shows the better classification performance on medical information.
出处
《计算机技术与发展》
2009年第4期206-209,共4页
Computer Technology and Development
基金
广东省医学科研基金资助项目(B2008088)
广东药学院科研基金资助项目(2007YGY01)
关键词
医药信息
文本分类
向量空间模型
KNN算法
medical information
text categorization
vector space model
KNN algorithm