摘要
专有名词的自动抽取是文本挖掘、信息检索和机器翻译等领域的关键技术。本文研究了组合SVM和KNN两种分类器进行汉语专有名词自动抽取的方法。对样本在空间的不同分布使用不同的分类方法,当测试样本与SVM最优超平面的距离大于给定的阈值时使用SVM分类,否则使用KNN;在实际训练语料中,常常是负类样本数远多于正类样本数,而传统KNN方法对不平衡训练集存在敏感性,所以提出了用归一化的思想对传统的KNN方法进行修正。实验表明,用SVM与修正的KNN组合算法进行汉语专有名词抽取比单一的SVM方法以及原始的SVM-KNN方法更具优越性,而且这种方法可以推广到其他非平衡分布样本的分类问题。
Extracting Chinese proper names is a key step in the fields of text mining,information retrieval and machine translation.This paper presents a method of extracting proper names from Chinese texts based on the fusion of support vector machine(SVM)and modified K nearest neighbors(KNN).Different classifiers are used for classifying the different test samples in spatial distributions.In the class phase,the algorithm computes the distance from the test sample to the hyperplane of SVM.If the distance is greater than the given threshold,the test sample would be classified on SVM; otherwise,the KNN algorithm will be used.In the practical training corpora,the negative class is represented by a large number of examples while the positive one is represented by only a few.To fit the unbalanced data,a normalized KNN classifier is proposed to modify classic KNN.The experimental results show that this model is more efficient than sole SVM and classic SVM-KNN in extracting Chinese proper names.The modified SVM-KNN model can be generalized to other fields of machine learning with unbalanced class distribution.
出处
《情报学报》
CSSCI
北大核心
2011年第6期610-617,共8页
Journal of the China Society for Scientific and Technical Information
基金
国家高技术研究发展计划(863计划)资助(No.2008AA04Z107)
关键词
KNN
SVM
专有名词抽取
不平衡数据
SVM
KNN
extraction of proper names
unbalanced data distribution