本文充分利用网页数据的超链接关系和文本信息,提出了一种用于网页分类的归纳式半监督学习算法:基于图的Co-training网页分类算法(Graph based Co-training algorithmfor web page classification),简称GCo-training,并从理论上证明了...本文充分利用网页数据的超链接关系和文本信息,提出了一种用于网页分类的归纳式半监督学习算法:基于图的Co-training网页分类算法(Graph based Co-training algorithmfor web page classification),简称GCo-training,并从理论上证明了算法的有效性.GCo-training在Co-training算法框架下,迭代地学习一个基于由超链接信息构造的图的半监督分类器和一个基于文本特征的Bayes分类器.基于图的半监督分类器只利用少量的标记数据,通过挖掘数据间大量的关系信息就可达到比较高的预测精度,可为Bayes分类器提供大量的标记信息;反过来学习大量标记信息后的Bayes分类器也可为基于图的分类器提供有效信息.迭代过程中,二者互相帮助,不断提高各自的性能,而后Bayes分类器可以用来预测大量未见数据的类别.在Web→KB数据集上的实验结果表明,与利用文本特征和锚文本特征的Co-training算法和基于EM的Bayes算法相比,GCo-training算法性能优越.展开更多
Co-training is a semi-supervised learning method, which employs two complementary learners to label the unlabeled data for each other and to predict the test sample together. Previous studies show that redundant infor...Co-training is a semi-supervised learning method, which employs two complementary learners to label the unlabeled data for each other and to predict the test sample together. Previous studies show that redundant information can help improve the ratio of prediction accuracy between semi-supervised learning methods and supervised learning methods. However, redundant information often practically hurts the performance of learning machines. This paper investigates what redundant features have effect on the semi-supervised learning methods, e.g. co-training, and how to remove the redundant features as well as the irrelevant features. Here, FESCOT (feature selection for co-training) is proposed to improve the generalization performance of co-training with feature selection. Experimental results on artificial and real world data sets show that FESCOT helps to remove irrelevant and redundant features that hurt the performance of the co-training method.展开更多
Due to the problems of few fault samples and large data fluctuations in the blast furnace(BF)ironmaking process,some transfer learning-based fault diagnosis methods are proposed.The vast majority of such methods perfo...Due to the problems of few fault samples and large data fluctuations in the blast furnace(BF)ironmaking process,some transfer learning-based fault diagnosis methods are proposed.The vast majority of such methods perform distribution adaptation by reducing the distance between data distributions and applying a classifier to generate pseudo-labels for self-training.However,since the training data is dominated by labeled source domain data,such classifiers tend to be weak classifiers in the target domain.In addition,the features generated after domain adaptation are likely to be at the decision boundary,resulting in a loss of classification performance.Hence,we propose a novel method called minimax entropy-based co-training(MMEC)that adversarially optimizes a transferable fault diagnosis model for the BF.The structure of MMEC includes a dual-view feature extractor,followed by two classifiers that compute the feature's cosine similarity to representative vector of each class.Knowledge transfer is achieved by alternately increasing and decreasing the entropy of unlabeled target samples with the classifier and the feature extractor,respectively.Transfer BF fault diagnosis experiments show that our method improves accuracy by about 5%over state-of-the-art methods.展开更多
The accuracy of laser-induced breakdown spectroscopy(LIBS) quantitative method is greatly dependent on the amount of certified standard samples used for training. However, in practical applications, only limited stand...The accuracy of laser-induced breakdown spectroscopy(LIBS) quantitative method is greatly dependent on the amount of certified standard samples used for training. However, in practical applications, only limited standard samples with labeled certified concentrations are available. A novel semi-supervised LIBS quantitative analysis method is proposed, based on co-training regression model with selection of effective unlabeled samples. The main idea of the proposed method is to obtain better regression performance by adding effective unlabeled samples in semisupervised learning. First, effective unlabeled samples are selected according to the testing samples by Euclidean metric. Two original regression models based on least squares support vector machine with different parameters are trained by the labeled samples separately, and then the effective unlabeled samples predicted by the two models are used to enlarge the training dataset based on labeling confidence estimation. The final predictions of the proposed method on the testing samples will be determined by weighted combinations of the predictions of two updated regression models. Chromium concentration analysis experiments of 23 certified standard high-alloy steel samples were carried out, in which 5 samples with labeled concentrations and 11 unlabeled samples were used to train the regression models and the remaining 7 samples were used for testing. With the numbers of effective unlabeled samples increasing, the root mean square error of the proposed method went down from 1.80% to 0.84% and the relative prediction error was reduced from 9.15% to 4.04%.展开更多
Chinese organization name recognition is hard and important in natural language processing. To reduce tagged corpus and use untagged corpus,we presented combing Co-training with support vector machines (SVM) and condi...Chinese organization name recognition is hard and important in natural language processing. To reduce tagged corpus and use untagged corpus,we presented combing Co-training with support vector machines (SVM) and conditional random fields (CRF) to improve recognition results. Based on principles of uncorrelated and compatible,we constructed different classifiers from different views within SVM or CRF alone and combination of these two models. And we modified a heuristic untagged samples selection algorithm to reduce time complexity. Experimental results show that under the same tagged data,Co-training has 10% F-measure higher than using SVM or CRF alone; under the same F-measure,Co-training saves at most 70% of tagged data to achieve the same performance.展开更多
文摘本文充分利用网页数据的超链接关系和文本信息,提出了一种用于网页分类的归纳式半监督学习算法:基于图的Co-training网页分类算法(Graph based Co-training algorithmfor web page classification),简称GCo-training,并从理论上证明了算法的有效性.GCo-training在Co-training算法框架下,迭代地学习一个基于由超链接信息构造的图的半监督分类器和一个基于文本特征的Bayes分类器.基于图的半监督分类器只利用少量的标记数据,通过挖掘数据间大量的关系信息就可达到比较高的预测精度,可为Bayes分类器提供大量的标记信息;反过来学习大量标记信息后的Bayes分类器也可为基于图的分类器提供有效信息.迭代过程中,二者互相帮助,不断提高各自的性能,而后Bayes分类器可以用来预测大量未见数据的类别.在Web→KB数据集上的实验结果表明,与利用文本特征和锚文本特征的Co-training算法和基于EM的Bayes算法相比,GCo-training算法性能优越.
基金Project supported by the National Natural Science Foundation of China (Grant No.20503015).
文摘Co-training is a semi-supervised learning method, which employs two complementary learners to label the unlabeled data for each other and to predict the test sample together. Previous studies show that redundant information can help improve the ratio of prediction accuracy between semi-supervised learning methods and supervised learning methods. However, redundant information often practically hurts the performance of learning machines. This paper investigates what redundant features have effect on the semi-supervised learning methods, e.g. co-training, and how to remove the redundant features as well as the irrelevant features. Here, FESCOT (feature selection for co-training) is proposed to improve the generalization performance of co-training with feature selection. Experimental results on artificial and real world data sets show that FESCOT helps to remove irrelevant and redundant features that hurt the performance of the co-training method.
基金supported in part by the National Natural Science Foundation of China(61933015)in part by the Central University Basic Research Fund of China under Grant K20200002(for NGICS Platform,Zhejiang University)。
文摘Due to the problems of few fault samples and large data fluctuations in the blast furnace(BF)ironmaking process,some transfer learning-based fault diagnosis methods are proposed.The vast majority of such methods perform distribution adaptation by reducing the distance between data distributions and applying a classifier to generate pseudo-labels for self-training.However,since the training data is dominated by labeled source domain data,such classifiers tend to be weak classifiers in the target domain.In addition,the features generated after domain adaptation are likely to be at the decision boundary,resulting in a loss of classification performance.Hence,we propose a novel method called minimax entropy-based co-training(MMEC)that adversarially optimizes a transferable fault diagnosis model for the BF.The structure of MMEC includes a dual-view feature extractor,followed by two classifiers that compute the feature's cosine similarity to representative vector of each class.Knowledge transfer is achieved by alternately increasing and decreasing the entropy of unlabeled target samples with the classifier and the feature extractor,respectively.Transfer BF fault diagnosis experiments show that our method improves accuracy by about 5%over state-of-the-art methods.
基金supported by National Natural Science Foundation of China (No. 51674032)
文摘The accuracy of laser-induced breakdown spectroscopy(LIBS) quantitative method is greatly dependent on the amount of certified standard samples used for training. However, in practical applications, only limited standard samples with labeled certified concentrations are available. A novel semi-supervised LIBS quantitative analysis method is proposed, based on co-training regression model with selection of effective unlabeled samples. The main idea of the proposed method is to obtain better regression performance by adding effective unlabeled samples in semisupervised learning. First, effective unlabeled samples are selected according to the testing samples by Euclidean metric. Two original regression models based on least squares support vector machine with different parameters are trained by the labeled samples separately, and then the effective unlabeled samples predicted by the two models are used to enlarge the training dataset based on labeling confidence estimation. The final predictions of the proposed method on the testing samples will be determined by weighted combinations of the predictions of two updated regression models. Chromium concentration analysis experiments of 23 certified standard high-alloy steel samples were carried out, in which 5 samples with labeled concentrations and 11 unlabeled samples were used to train the regression models and the remaining 7 samples were used for testing. With the numbers of effective unlabeled samples increasing, the root mean square error of the proposed method went down from 1.80% to 0.84% and the relative prediction error was reduced from 9.15% to 4.04%.
基金National Natural Science Foundations of China (No.60873179, No.60803078)
文摘Chinese organization name recognition is hard and important in natural language processing. To reduce tagged corpus and use untagged corpus,we presented combing Co-training with support vector machines (SVM) and conditional random fields (CRF) to improve recognition results. Based on principles of uncorrelated and compatible,we constructed different classifiers from different views within SVM or CRF alone and combination of these two models. And we modified a heuristic untagged samples selection algorithm to reduce time complexity. Experimental results show that under the same tagged data,Co-training has 10% F-measure higher than using SVM or CRF alone; under the same F-measure,Co-training saves at most 70% of tagged data to achieve the same performance.