本文充分利用网页数据的超链接关系和文本信息,提出了一种用于网页分类的归纳式半监督学习算法:基于图的Co-training网页分类算法(Graph based Co-training algorithmfor web page classification),简称GCo-training,并从理论上证明了...本文充分利用网页数据的超链接关系和文本信息,提出了一种用于网页分类的归纳式半监督学习算法:基于图的Co-training网页分类算法(Graph based Co-training algorithmfor web page classification),简称GCo-training,并从理论上证明了算法的有效性.GCo-training在Co-training算法框架下,迭代地学习一个基于由超链接信息构造的图的半监督分类器和一个基于文本特征的Bayes分类器.基于图的半监督分类器只利用少量的标记数据,通过挖掘数据间大量的关系信息就可达到比较高的预测精度,可为Bayes分类器提供大量的标记信息;反过来学习大量标记信息后的Bayes分类器也可为基于图的分类器提供有效信息.迭代过程中,二者互相帮助,不断提高各自的性能,而后Bayes分类器可以用来预测大量未见数据的类别.在Web→KB数据集上的实验结果表明,与利用文本特征和锚文本特征的Co-training算法和基于EM的Bayes算法相比,GCo-training算法性能优越.展开更多
Co-training is a semi-supervised learning method, which employs two complementary learners to label the unlabeled data for each other and to predict the test sample together. Previous studies show that redundant infor...Co-training is a semi-supervised learning method, which employs two complementary learners to label the unlabeled data for each other and to predict the test sample together. Previous studies show that redundant information can help improve the ratio of prediction accuracy between semi-supervised learning methods and supervised learning methods. However, redundant information often practically hurts the performance of learning machines. This paper investigates what redundant features have effect on the semi-supervised learning methods, e.g. co-training, and how to remove the redundant features as well as the irrelevant features. Here, FESCOT (feature selection for co-training) is proposed to improve the generalization performance of co-training with feature selection. Experimental results on artificial and real world data sets show that FESCOT helps to remove irrelevant and redundant features that hurt the performance of the co-training method.展开更多
The accuracy of laser-induced breakdown spectroscopy(LIBS) quantitative method is greatly dependent on the amount of certified standard samples used for training. However, in practical applications, only limited stand...The accuracy of laser-induced breakdown spectroscopy(LIBS) quantitative method is greatly dependent on the amount of certified standard samples used for training. However, in practical applications, only limited standard samples with labeled certified concentrations are available. A novel semi-supervised LIBS quantitative analysis method is proposed, based on co-training regression model with selection of effective unlabeled samples. The main idea of the proposed method is to obtain better regression performance by adding effective unlabeled samples in semisupervised learning. First, effective unlabeled samples are selected according to the testing samples by Euclidean metric. Two original regression models based on least squares support vector machine with different parameters are trained by the labeled samples separately, and then the effective unlabeled samples predicted by the two models are used to enlarge the training dataset based on labeling confidence estimation. The final predictions of the proposed method on the testing samples will be determined by weighted combinations of the predictions of two updated regression models. Chromium concentration analysis experiments of 23 certified standard high-alloy steel samples were carried out, in which 5 samples with labeled concentrations and 11 unlabeled samples were used to train the regression models and the remaining 7 samples were used for testing. With the numbers of effective unlabeled samples increasing, the root mean square error of the proposed method went down from 1.80% to 0.84% and the relative prediction error was reduced from 9.15% to 4.04%.展开更多
Due to the problems of few fault samples and large data fluctuations in the blast furnace(BF)ironmaking process,some transfer learning-based fault diagnosis methods are proposed.The vast majority of such methods perfo...Due to the problems of few fault samples and large data fluctuations in the blast furnace(BF)ironmaking process,some transfer learning-based fault diagnosis methods are proposed.The vast majority of such methods perform distribution adaptation by reducing the distance between data distributions and applying a classifier to generate pseudo-labels for self-training.However,since the training data is dominated by labeled source domain data,such classifiers tend to be weak classifiers in the target domain.In addition,the features generated after domain adaptation are likely to be at the decision boundary,resulting in a loss of classification performance.Hence,we propose a novel method called minimax entropy-based co-training(MMEC)that adversarially optimizes a transferable fault diagnosis model for the BF.The structure of MMEC includes a dual-view feature extractor,followed by two classifiers that compute the feature's cosine similarity to representative vector of each class.Knowledge transfer is achieved by alternately increasing and decreasing the entropy of unlabeled target samples with the classifier and the feature extractor,respectively.Transfer BF fault diagnosis experiments show that our method improves accuracy by about 5%over state-of-the-art methods.展开更多
Chinese organization name recognition is hard and important in natural language processing. To reduce tagged corpus and use untagged corpus,we presented combing Co-training with support vector machines (SVM) and condi...Chinese organization name recognition is hard and important in natural language processing. To reduce tagged corpus and use untagged corpus,we presented combing Co-training with support vector machines (SVM) and conditional random fields (CRF) to improve recognition results. Based on principles of uncorrelated and compatible,we constructed different classifiers from different views within SVM or CRF alone and combination of these two models. And we modified a heuristic untagged samples selection algorithm to reduce time complexity. Experimental results show that under the same tagged data,Co-training has 10% F-measure higher than using SVM or CRF alone; under the same F-measure,Co-training saves at most 70% of tagged data to achieve the same performance.展开更多
Classification of network traffic is the essential step for many network researches. However, with the rapid evolution of Internet applications the effectiveness of the port-based or payload-based identification appro...Classification of network traffic is the essential step for many network researches. However, with the rapid evolution of Internet applications the effectiveness of the port-based or payload-based identification approaches has been greatly diminished in recent years. And many researchers begin to turn their attentions to an alternative machine learning based method. This paper presents a novel machine learning-based classification model, which combines ensemble learning paradigm with co-training techniques. Compared to previous approaches, most of which only employed single classifier, multiple classifters and semi-supervised learning are applied in our method and it mainly helps to overcome three shortcomings: limited flow accuracy rate, weak adaptability and huge demand of labeled training set. In this paper, statistical characteristics of IP flows are extracted from the packet level traces to establish the feature set, then the classification model is crested and tested and the empirical results prove its feasibility and effectiveness.展开更多
Image sentiment classification, which aims to predict the polarities of sentiments conveyed by the images, has gained a lot of attention. Most existing methods address this problem by training a general classifier wit...Image sentiment classification, which aims to predict the polarities of sentiments conveyed by the images, has gained a lot of attention. Most existing methods address this problem by training a general classifier with certain visual features, ignoring the discrepancies across domains. In this paper, we propose a novel weighted co-training method for cross-domain image sentiment classification, which iteratively enlarges the labeled set by introducing new high-confidence classified samples to reduce the gap between the two domains. We train two sentiment classifiers with both the images and the corresponding textual comments separately, and set the similarity between the source domain and the target domain as the weight of a classifier. We perform extensive experiments on a real Flickr dataset to evaluate the proposed method, and the empirical study reveals that the weighted co-training method significantly outperforms some baseline solutions.展开更多
Graph neural networks(GNNs)have achieved significant success in graph representation learning.Nevertheless,the recent work indicates that current GNNs are vulnerable to adversarial perturbations,in particular structur...Graph neural networks(GNNs)have achieved significant success in graph representation learning.Nevertheless,the recent work indicates that current GNNs are vulnerable to adversarial perturbations,in particular structural perturbations.This,therefore,narrows the application of GNN models in real-world scenarios.Such vulnerability can be attributed to the model’s excessive reliance on incomplete data views(e.g.,graph convolutional networks(GCNs)heavily rely on graph structures to make predictions).By integrating the information from multiple perspectives,this problem can be effectively addressed,and typical views of graphs include the node feature view and the graph structure view.In this paper,we propose C^(2)oG,which combines these two typical views to train sub-models and fuses their knowledge through co-training.Due to the orthogonality of the views,sub-models in the feature view tend to be robust against the perturbations targeted at sub-models in the structure view.C^(2)oG allows sub-models to correct one another mutually and thus enhance the robustness of their ensembles.In our evaluations,C^(2)oG significantly improves the robustness of graph models against adversarial attacks without sacrificing their performance on clean datasets.展开更多
A lack of labeled corpora obstructs the research progress on implicit discourse relation recognition (DRR) for Chinese, while there are some available discourse corpora in other languages, such as English. In this p...A lack of labeled corpora obstructs the research progress on implicit discourse relation recognition (DRR) for Chinese, while there are some available discourse corpora in other languages, such as English. In this paper, we propose a cross-lingual implicit DRR framework that exploits an available English corpus for the Chinese DRR task. We use machine translation to generate Chinese instances from a labeled English discourse corpus. In this way, each instance has two independent views: Chinese and English views. Then we train two classifiers in Chinese and English in a co-training way, which exploits unlabeled Chinese data to implement better implicit DRR for Chinese. Experimental results demonstrate the effectiveness of our method.展开更多
Co-training is a famous semi-supervised learning algorithm which can exploit unlabeled data to improve learning performance.Generally it works under a two-view setting (the input examples have two disjoint feature set...Co-training is a famous semi-supervised learning algorithm which can exploit unlabeled data to improve learning performance.Generally it works under a two-view setting (the input examples have two disjoint feature sets in nature),with the assumption that each view is sufficient to predict the label.However,in real-world applications due to feature corruption or feature noise,both views may be insufficient and co-training will suffer from these insufficient views.In this paper,we propose a novel algorithm named Weighted Co-training to deal with this problem.It identifies the newly labeled examples that are probably harmful for the other view,and decreases their weights in the training set to avoid the risk.The experimental results show that Weighted Co-training performs better than the state-of-art co-training algorithms on several benchmarks.展开更多
For large-scale radio frequency identification(RFID) indoor positioning system, the positioning scale is relatively large, with less labeled data and more unlabeled data, and it is easily affected by multipath and whi...For large-scale radio frequency identification(RFID) indoor positioning system, the positioning scale is relatively large, with less labeled data and more unlabeled data, and it is easily affected by multipath and white noise. An RFID positioning algorithm based on semi-supervised actor-critic co-training(SACC) was proposed to solve this problem. In this research, the positioning is regarded as Markov decision-making process. Firstly, the actor-critic was combined with random actions and the unlabeled best received signal arrival intensity(RSSI) data was selected by co-training of the semi-supervised. Secondly, the actor and the critic were updated by employing Kronecker-factored approximation calculate(K-FAC) natural gradient. Finally, the target position was obtained by co-locating with labeled RSSI data and the selected unlabeled RSSI data. The proposed method reduced the cost of indoor positioning significantly by decreasing the number of labeled data. Meanwhile, with the increase of the positioning targets, the actor could quickly select unlabeled RSSI data and updates the location model. Experiment shows that, compared with other RFID indoor positioning algorithms, such as twin delayed deep deterministic policy gradient(TD3), deep deterministic policy gradient(DDPG), and actor-critic using Kronecker-factored trust region(ACKTR), the proposed method decreased the average positioning error respectively by 50.226%, 41.916%, and 25.004%. Meanwhile, the positioning stability was improved by 23.430%, 28.518%, and 38.631%.展开更多
文摘本文充分利用网页数据的超链接关系和文本信息,提出了一种用于网页分类的归纳式半监督学习算法:基于图的Co-training网页分类算法(Graph based Co-training algorithmfor web page classification),简称GCo-training,并从理论上证明了算法的有效性.GCo-training在Co-training算法框架下,迭代地学习一个基于由超链接信息构造的图的半监督分类器和一个基于文本特征的Bayes分类器.基于图的半监督分类器只利用少量的标记数据,通过挖掘数据间大量的关系信息就可达到比较高的预测精度,可为Bayes分类器提供大量的标记信息;反过来学习大量标记信息后的Bayes分类器也可为基于图的分类器提供有效信息.迭代过程中,二者互相帮助,不断提高各自的性能,而后Bayes分类器可以用来预测大量未见数据的类别.在Web→KB数据集上的实验结果表明,与利用文本特征和锚文本特征的Co-training算法和基于EM的Bayes算法相比,GCo-training算法性能优越.
基金Project supported by the National Natural Science Foundation of China (Grant No.20503015).
文摘Co-training is a semi-supervised learning method, which employs two complementary learners to label the unlabeled data for each other and to predict the test sample together. Previous studies show that redundant information can help improve the ratio of prediction accuracy between semi-supervised learning methods and supervised learning methods. However, redundant information often practically hurts the performance of learning machines. This paper investigates what redundant features have effect on the semi-supervised learning methods, e.g. co-training, and how to remove the redundant features as well as the irrelevant features. Here, FESCOT (feature selection for co-training) is proposed to improve the generalization performance of co-training with feature selection. Experimental results on artificial and real world data sets show that FESCOT helps to remove irrelevant and redundant features that hurt the performance of the co-training method.
基金supported by National Natural Science Foundation of China (No. 51674032)
文摘The accuracy of laser-induced breakdown spectroscopy(LIBS) quantitative method is greatly dependent on the amount of certified standard samples used for training. However, in practical applications, only limited standard samples with labeled certified concentrations are available. A novel semi-supervised LIBS quantitative analysis method is proposed, based on co-training regression model with selection of effective unlabeled samples. The main idea of the proposed method is to obtain better regression performance by adding effective unlabeled samples in semisupervised learning. First, effective unlabeled samples are selected according to the testing samples by Euclidean metric. Two original regression models based on least squares support vector machine with different parameters are trained by the labeled samples separately, and then the effective unlabeled samples predicted by the two models are used to enlarge the training dataset based on labeling confidence estimation. The final predictions of the proposed method on the testing samples will be determined by weighted combinations of the predictions of two updated regression models. Chromium concentration analysis experiments of 23 certified standard high-alloy steel samples were carried out, in which 5 samples with labeled concentrations and 11 unlabeled samples were used to train the regression models and the remaining 7 samples were used for testing. With the numbers of effective unlabeled samples increasing, the root mean square error of the proposed method went down from 1.80% to 0.84% and the relative prediction error was reduced from 9.15% to 4.04%.
基金supported in part by the National Natural Science Foundation of China(61933015)in part by the Central University Basic Research Fund of China under Grant K20200002(for NGICS Platform,Zhejiang University)。
文摘Due to the problems of few fault samples and large data fluctuations in the blast furnace(BF)ironmaking process,some transfer learning-based fault diagnosis methods are proposed.The vast majority of such methods perform distribution adaptation by reducing the distance between data distributions and applying a classifier to generate pseudo-labels for self-training.However,since the training data is dominated by labeled source domain data,such classifiers tend to be weak classifiers in the target domain.In addition,the features generated after domain adaptation are likely to be at the decision boundary,resulting in a loss of classification performance.Hence,we propose a novel method called minimax entropy-based co-training(MMEC)that adversarially optimizes a transferable fault diagnosis model for the BF.The structure of MMEC includes a dual-view feature extractor,followed by two classifiers that compute the feature's cosine similarity to representative vector of each class.Knowledge transfer is achieved by alternately increasing and decreasing the entropy of unlabeled target samples with the classifier and the feature extractor,respectively.Transfer BF fault diagnosis experiments show that our method improves accuracy by about 5%over state-of-the-art methods.
基金National Natural Science Foundations of China (No.60873179, No.60803078)
文摘Chinese organization name recognition is hard and important in natural language processing. To reduce tagged corpus and use untagged corpus,we presented combing Co-training with support vector machines (SVM) and conditional random fields (CRF) to improve recognition results. Based on principles of uncorrelated and compatible,we constructed different classifiers from different views within SVM or CRF alone and combination of these two models. And we modified a heuristic untagged samples selection algorithm to reduce time complexity. Experimental results show that under the same tagged data,Co-training has 10% F-measure higher than using SVM or CRF alone; under the same F-measure,Co-training saves at most 70% of tagged data to achieve the same performance.
基金Supported by the National Natural Science Foundation of China (Grant Nos.60525213 and 60776096)the National Basic Research Program of China (Grant No.2006CB303106)+2 种基金the National High-Tech Research & Development Program of China (Grant Nos.2007AA01Z236 and 2007AA01Z449)the Joint Funds of NSFC-Guangdong (Grant No.U0735001)the National Project of Scientific and Technical Supporting Programs (Grant No.2007BAH13B01)
文摘Classification of network traffic is the essential step for many network researches. However, with the rapid evolution of Internet applications the effectiveness of the port-based or payload-based identification approaches has been greatly diminished in recent years. And many researchers begin to turn their attentions to an alternative machine learning based method. This paper presents a novel machine learning-based classification model, which combines ensemble learning paradigm with co-training techniques. Compared to previous approaches, most of which only employed single classifier, multiple classifters and semi-supervised learning are applied in our method and it mainly helps to overcome three shortcomings: limited flow accuracy rate, weak adaptability and huge demand of labeled training set. In this paper, statistical characteristics of IP flows are extracted from the packet level traces to establish the feature set, then the classification model is crested and tested and the empirical results prove its feasibility and effectiveness.
文摘Image sentiment classification, which aims to predict the polarities of sentiments conveyed by the images, has gained a lot of attention. Most existing methods address this problem by training a general classifier with certain visual features, ignoring the discrepancies across domains. In this paper, we propose a novel weighted co-training method for cross-domain image sentiment classification, which iteratively enlarges the labeled set by introducing new high-confidence classified samples to reduce the gap between the two domains. We train two sentiment classifiers with both the images and the corresponding textual comments separately, and set the similarity between the source domain and the target domain as the weight of a classifier. We perform extensive experiments on a real Flickr dataset to evaluate the proposed method, and the empirical study reveals that the weighted co-training method significantly outperforms some baseline solutions.
基金This work was partially supported by the National University of Defense Technology Foundation under Grant Nos.ZK20-09 and ZK21-17,and the Natural Science Foundation of Hunan Province of China under Grant No.2021JJ40692.
文摘Graph neural networks(GNNs)have achieved significant success in graph representation learning.Nevertheless,the recent work indicates that current GNNs are vulnerable to adversarial perturbations,in particular structural perturbations.This,therefore,narrows the application of GNN models in real-world scenarios.Such vulnerability can be attributed to the model’s excessive reliance on incomplete data views(e.g.,graph convolutional networks(GCNs)heavily rely on graph structures to make predictions).By integrating the information from multiple perspectives,this problem can be effectively addressed,and typical views of graphs include the node feature view and the graph structure view.In this paper,we propose C^(2)oG,which combines these two typical views to train sub-models and fuses their knowledge through co-training.Due to the orthogonality of the views,sub-models in the feature view tend to be robust against the perturbations targeted at sub-models in the structure view.C^(2)oG allows sub-models to correct one another mutually and thus enhance the robustness of their ensembles.In our evaluations,C^(2)oG significantly improves the robustness of graph models against adversarial attacks without sacrificing their performance on clean datasets.
基金Project supported by the National Natural Science Foundation of China(No.61672440)the Natural Science Foundation of Fujian Province,China(No.2016J05161)+2 种基金the Research Fund of the State Key Laboratory for Novel Software Technology in Nanjing University,China(No.KFKT2015B11)the Scientific Research Project of the National Language Committee of China(No.YB135-49)the Fundamental Research Funds for the Central Universities,China(No.ZK1024)
文摘A lack of labeled corpora obstructs the research progress on implicit discourse relation recognition (DRR) for Chinese, while there are some available discourse corpora in other languages, such as English. In this paper, we propose a cross-lingual implicit DRR framework that exploits an available English corpus for the Chinese DRR task. We use machine translation to generate Chinese instances from a labeled English discourse corpus. In this way, each instance has two independent views: Chinese and English views. Then we train two classifiers in Chinese and English in a co-training way, which exploits unlabeled Chinese data to implement better implicit DRR for Chinese. Experimental results demonstrate the effectiveness of our method.
文摘Co-training is a famous semi-supervised learning algorithm which can exploit unlabeled data to improve learning performance.Generally it works under a two-view setting (the input examples have two disjoint feature sets in nature),with the assumption that each view is sufficient to predict the label.However,in real-world applications due to feature corruption or feature noise,both views may be insufficient and co-training will suffer from these insufficient views.In this paper,we propose a novel algorithm named Weighted Co-training to deal with this problem.It identifies the newly labeled examples that are probably harmful for the other view,and decreases their weights in the training set to avoid the risk.The experimental results show that Weighted Co-training performs better than the state-of-art co-training algorithms on several benchmarks.
基金the National Natural Science Foundation of China(61761004)the Natural Science Foundation of Guangxi Province,China(2019GXNSFAA245045)。
文摘For large-scale radio frequency identification(RFID) indoor positioning system, the positioning scale is relatively large, with less labeled data and more unlabeled data, and it is easily affected by multipath and white noise. An RFID positioning algorithm based on semi-supervised actor-critic co-training(SACC) was proposed to solve this problem. In this research, the positioning is regarded as Markov decision-making process. Firstly, the actor-critic was combined with random actions and the unlabeled best received signal arrival intensity(RSSI) data was selected by co-training of the semi-supervised. Secondly, the actor and the critic were updated by employing Kronecker-factored approximation calculate(K-FAC) natural gradient. Finally, the target position was obtained by co-locating with labeled RSSI data and the selected unlabeled RSSI data. The proposed method reduced the cost of indoor positioning significantly by decreasing the number of labeled data. Meanwhile, with the increase of the positioning targets, the actor could quickly select unlabeled RSSI data and updates the location model. Experiment shows that, compared with other RFID indoor positioning algorithms, such as twin delayed deep deterministic policy gradient(TD3), deep deterministic policy gradient(DDPG), and actor-critic using Kronecker-factored trust region(ACKTR), the proposed method decreased the average positioning error respectively by 50.226%, 41.916%, and 25.004%. Meanwhile, the positioning stability was improved by 23.430%, 28.518%, and 38.631%.