摘要
在多标记学习中,一个示例可以有多个概念标记。学习系统的目标是通过对由多标记样本组成的训练集进行学习,以尽可能正确地预测未知样本所对应的概念标记集。k近邻算法已被应用到多标记学习中,该算法将测试示例转化为多维向量,根据其k个近邻样本的标记向量来确定该测试示例的标记向量。传统的k近邻算法是基于向量的空间距离来选取近邻,而在自然语言处理中,文本间的相似度常用文本向量的夹角来表示,所以本文将文本向量间的夹角关系作为选取k近邻的标准并结合k近邻算法提出了一种多标记文本学习算法。实验表明,该算法在文档分类的准确率上体现出较好的性能。
In multi-label learning, each instance in the training set is associated with a set of labels, and the task is to output a label set whose size is unknown a priori for each unseen instance, k nearest neighbors (kNN) algorithm is recently applied to multi-label categorization. In detail, each instance is transformed into a vector and the label vector of the test instance is determined by its k nearest neighbors, which are chosen by the Euclidean distance of a couple of vectors. In this paper, a multi-label lazy learning approach named θ -MLkNN is presented, which is derived from the traditional k nearest neighbor (kNN) algorithm. Instead, we select the k nearest neighbors by the angle of two vectors. Experiments on a real-world text data set show that θ -MLkNN achieves better precision to traditional MLkNN algorithms.
出处
《计算机科学》
CSCD
北大核心
2008年第4期205-206,F0003,共3页
Computer Science
关键词
机器学习
多标记学习
文本分类
Machine learning, Multi-label learning, Text categorization