The accuracy of laser-induced breakdown spectroscopy(LIBS) quantitative method is greatly dependent on the amount of certified standard samples used for training. However, in practical applications, only limited stand...The accuracy of laser-induced breakdown spectroscopy(LIBS) quantitative method is greatly dependent on the amount of certified standard samples used for training. However, in practical applications, only limited standard samples with labeled certified concentrations are available. A novel semi-supervised LIBS quantitative analysis method is proposed, based on co-training regression model with selection of effective unlabeled samples. The main idea of the proposed method is to obtain better regression performance by adding effective unlabeled samples in semisupervised learning. First, effective unlabeled samples are selected according to the testing samples by Euclidean metric. Two original regression models based on least squares support vector machine with different parameters are trained by the labeled samples separately, and then the effective unlabeled samples predicted by the two models are used to enlarge the training dataset based on labeling confidence estimation. The final predictions of the proposed method on the testing samples will be determined by weighted combinations of the predictions of two updated regression models. Chromium concentration analysis experiments of 23 certified standard high-alloy steel samples were carried out, in which 5 samples with labeled concentrations and 11 unlabeled samples were used to train the regression models and the remaining 7 samples were used for testing. With the numbers of effective unlabeled samples increasing, the root mean square error of the proposed method went down from 1.80% to 0.84% and the relative prediction error was reduced from 9.15% to 4.04%.展开更多
Active learning in semi-supervised classification involves introducing additional labels for unlabelled data to improve the accuracy of the underlying classifier.A challenge is to identify which points to label to bes...Active learning in semi-supervised classification involves introducing additional labels for unlabelled data to improve the accuracy of the underlying classifier.A challenge is to identify which points to label to best improve performance while limiting the number of new labels."Model Change"active learning quantifies the resulting change incurred in the classifier by introducing the additional label(s).We pair this idea with graph-based semi-supervised learning(SSL)methods,that use the spectrum of the graph Laplacian matrix,which can be truncated to avoid prohibitively large computational and storage costs.We consider a family of convex loss functions for which the acquisition function can be efficiently approximated using the Laplace approximation of the posterior distribution.We show a variety of multiclass examples that illustrate improved performance over prior state-of-art.展开更多
This paper proposed a semi-supervised regression model with co-training algorithm based on support vector machine, which was used for retrieving water quality variables from SPOT 5 remote sensing data. The model consi...This paper proposed a semi-supervised regression model with co-training algorithm based on support vector machine, which was used for retrieving water quality variables from SPOT 5 remote sensing data. The model consisted of two support vector regressors (SVRs). Nonlinear relationship between water quality variables and SPOT 5 spectrum was described by the two SVRs, and semi-supervised co-training algorithm for the SVRs was es-tablished. The model was used for retrieving concentrations of four representative pollution indicators―permangan- ate index (CODmn), ammonia nitrogen (NH3-N), chemical oxygen demand (COD) and dissolved oxygen (DO) of the Weihe River in Shaanxi Province, China. The spatial distribution map for those variables over a part of the Weihe River was also produced. SVR can be used to implement any nonlinear mapping readily, and semi-supervis- ed learning can make use of both labeled and unlabeled samples. By integrating the two SVRs and using semi-supervised learning, we provide an operational method when paired samples are limited. The results show that it is much better than the multiple statistical regression method, and can provide the whole water pollution condi-tions for management fast and can be extended to hyperspectral remote sensing applications.展开更多
Co-training is a semi-supervised learning method, which employs two complementary learners to label the unlabeled data for each other and to predict the test sample together. Previous studies show that redundant infor...Co-training is a semi-supervised learning method, which employs two complementary learners to label the unlabeled data for each other and to predict the test sample together. Previous studies show that redundant information can help improve the ratio of prediction accuracy between semi-supervised learning methods and supervised learning methods. However, redundant information often practically hurts the performance of learning machines. This paper investigates what redundant features have effect on the semi-supervised learning methods, e.g. co-training, and how to remove the redundant features as well as the irrelevant features. Here, FESCOT (feature selection for co-training) is proposed to improve the generalization performance of co-training with feature selection. Experimental results on artificial and real world data sets show that FESCOT helps to remove irrelevant and redundant features that hurt the performance of the co-training method.展开更多
These problems of nonlinearity, fuzziness and few labeled data were rarely considered in traditional remote sensing image classification. A semi-supervised kernel fuzzy C-means (SSKFCM) algorithm is proposed to over...These problems of nonlinearity, fuzziness and few labeled data were rarely considered in traditional remote sensing image classification. A semi-supervised kernel fuzzy C-means (SSKFCM) algorithm is proposed to overcome these disadvantages of remote sensing image classification in this paper. The SSKFCM algorithm is achieved by introducing a kernel method and semi-supervised learning technique into the standard fuzzy C-means (FCM) algorithm. A set of Beijing-1 micro-satellite's multispectral images are adopted to be classified by several algorithms, such as FCM, kernel FCM (KFCM), semi-supervised FCM (SSFCM) and SSKFCM. The classification results are estimated by corresponding indexes. The results indicate that the SSKFCM algorithm significantly improves the classification accuracy of remote sensing images compared with the others.展开更多
For large-scale radio frequency identification(RFID) indoor positioning system, the positioning scale is relatively large, with less labeled data and more unlabeled data, and it is easily affected by multipath and whi...For large-scale radio frequency identification(RFID) indoor positioning system, the positioning scale is relatively large, with less labeled data and more unlabeled data, and it is easily affected by multipath and white noise. An RFID positioning algorithm based on semi-supervised actor-critic co-training(SACC) was proposed to solve this problem. In this research, the positioning is regarded as Markov decision-making process. Firstly, the actor-critic was combined with random actions and the unlabeled best received signal arrival intensity(RSSI) data was selected by co-training of the semi-supervised. Secondly, the actor and the critic were updated by employing Kronecker-factored approximation calculate(K-FAC) natural gradient. Finally, the target position was obtained by co-locating with labeled RSSI data and the selected unlabeled RSSI data. The proposed method reduced the cost of indoor positioning significantly by decreasing the number of labeled data. Meanwhile, with the increase of the positioning targets, the actor could quickly select unlabeled RSSI data and updates the location model. Experiment shows that, compared with other RFID indoor positioning algorithms, such as twin delayed deep deterministic policy gradient(TD3), deep deterministic policy gradient(DDPG), and actor-critic using Kronecker-factored trust region(ACKTR), the proposed method decreased the average positioning error respectively by 50.226%, 41.916%, and 25.004%. Meanwhile, the positioning stability was improved by 23.430%, 28.518%, and 38.631%.展开更多
Deep learning models have achieved state-of-the-art performance in named entity recognition(NER);the good performance,however,relies heavily on substantial amounts of labeled data.In some specific areas such as medica...Deep learning models have achieved state-of-the-art performance in named entity recognition(NER);the good performance,however,relies heavily on substantial amounts of labeled data.In some specific areas such as medical,financial,and military domains,labeled data is very scarce,while unlabeled data is readily available.Previous studies have used unlabeled data to enrich word representations,but a large amount of entity information in unlabeled data is neglected,which may be beneficial to the NER task.In this study,we propose a semi-supervised method for NER tasks,which learns to create high-quality labeled data by applying a pre-trained module to filter out erroneous pseudo labels.Pseudo labels are automatically generated for unlabeled data and used as if they were true labels.Our semi-supervised framework includes three steps:constructing an optimal single neural model for a specific NER task,learning a module that evaluates pseudo labels,and creating new labeled data and improving the NER model iteratively.Experimental results on two English NER tasks and one Chinese clinical NER task demonstrate that our method further improves the performance of the best single neural model.Even when we use only pre-trained static word embeddings and do not rely on any external knowledge,our method achieves comparable performance to those state-of-the-art models on the CoNLL-2003 and OntoNotes 5.0 English NER tasks.展开更多
Many data mining applications have a large amount of data but labeling data is usually difficult, expensive, or time consuming, as it requires human experts for annotation. Semi-supervised learning addresses this prob...Many data mining applications have a large amount of data but labeling data is usually difficult, expensive, or time consuming, as it requires human experts for annotation. Semi-supervised learning addresses this problem by using unlabeled data together with labeled data in the training process. Co-Training is a popular semi-supervised learning algorithm that has the assumptions that each example is represented by multiple sets of features (views) and these views are sufficient for learning and independent given the class. However, these assumptions axe strong and are not satisfied in many real-world domains. In this paper, a single-view variant of Co-Training, called Co-Training by Committee (CoBC) is proposed, in which an ensemble of diverse classifiers is used instead of redundant and independent views. We introduce a new labeling confidence measure for unlabeled examples based on estimating the local accuracy of the committee members on its neighborhood. Then we introduce two new learning algorithms, QBC-then-CoBC and QBC-with-CoBC, which combine the merits of committee-based semi-supervised learning and active learning. The random subspace method is applied on both C4.5 decision trees and 1-nearest neighbor classifiers to construct the diverse ensembles used for semi-supervised learning and active learning. Experiments show that these two combinations can outperform other non committee-based ones.展开更多
Co-training is a famous semi-supervised learning algorithm which can exploit unlabeled data to improve learning performance.Generally it works under a two-view setting (the input examples have two disjoint feature set...Co-training is a famous semi-supervised learning algorithm which can exploit unlabeled data to improve learning performance.Generally it works under a two-view setting (the input examples have two disjoint feature sets in nature),with the assumption that each view is sufficient to predict the label.However,in real-world applications due to feature corruption or feature noise,both views may be insufficient and co-training will suffer from these insufficient views.In this paper,we propose a novel algorithm named Weighted Co-training to deal with this problem.It identifies the newly labeled examples that are probably harmful for the other view,and decreases their weights in the training set to avoid the risk.The experimental results show that Weighted Co-training performs better than the state-of-art co-training algorithms on several benchmarks.展开更多
We advance here a novel methodology for robust intelligent biometric information management with inferences and predictions made using randomness and complexity concepts. Intelligence refers to learning, adap- tation,...We advance here a novel methodology for robust intelligent biometric information management with inferences and predictions made using randomness and complexity concepts. Intelligence refers to learning, adap- tation, and functionality, and robustness refers to the ability to handle incomplete and/or corrupt adversarial information, on one side, and image and or device variability, on the other side. The proposed methodology is model-free and non-parametric. It draws support from discriminative methods using likelihood ratios to link at the conceptual level biometrics and forensics. It further links, at the modeling and implementation level, the Bayesian framework, statistical learning theory (SLT) using transduction and semi-supervised lea- rning, and Information Theory (IY) using mutual information. The key concepts supporting the proposed methodology are a) local estimation to facilitate learning and prediction using both labeled and unlabeled data;b) similarity metrics using regularity of patterns, randomness deficiency, and Kolmogorov complexity (similar to MDL) using strangeness/typicality and ranking p-values;and c) the Cover – Hart theorem on the asymptotical performance of k-nearest neighbors approaching the optimal Bayes error. Several topics on biometric inference and prediction related to 1) multi-level and multi-layer data fusion including quality and multi-modal biometrics;2) score normalization and revision theory;3) face selection and tracking;and 4) identity management, are described here using an integrated approach that includes transduction and boosting for ranking and sequential fusion/aggregation, respectively, on one side, and active learning and change/ outlier/intrusion detection realized using information gain and martingale, respectively, on the other side. The methodology proposed can be mapped to additional types of information beyond biometrics.展开更多
基金supported by National Natural Science Foundation of China (No. 51674032)
文摘The accuracy of laser-induced breakdown spectroscopy(LIBS) quantitative method is greatly dependent on the amount of certified standard samples used for training. However, in practical applications, only limited standard samples with labeled certified concentrations are available. A novel semi-supervised LIBS quantitative analysis method is proposed, based on co-training regression model with selection of effective unlabeled samples. The main idea of the proposed method is to obtain better regression performance by adding effective unlabeled samples in semisupervised learning. First, effective unlabeled samples are selected according to the testing samples by Euclidean metric. Two original regression models based on least squares support vector machine with different parameters are trained by the labeled samples separately, and then the effective unlabeled samples predicted by the two models are used to enlarge the training dataset based on labeling confidence estimation. The final predictions of the proposed method on the testing samples will be determined by weighted combinations of the predictions of two updated regression models. Chromium concentration analysis experiments of 23 certified standard high-alloy steel samples were carried out, in which 5 samples with labeled concentrations and 11 unlabeled samples were used to train the regression models and the remaining 7 samples were used for testing. With the numbers of effective unlabeled samples increasing, the root mean square error of the proposed method went down from 1.80% to 0.84% and the relative prediction error was reduced from 9.15% to 4.04%.
基金supported by the DOD National Defense Science and Engineering Graduate(NDSEG)Research Fellowshipsupported by the NGA under Contract No.HM04762110003.
文摘Active learning in semi-supervised classification involves introducing additional labels for unlabelled data to improve the accuracy of the underlying classifier.A challenge is to identify which points to label to best improve performance while limiting the number of new labels."Model Change"active learning quantifies the resulting change incurred in the classifier by introducing the additional label(s).We pair this idea with graph-based semi-supervised learning(SSL)methods,that use the spectrum of the graph Laplacian matrix,which can be truncated to avoid prohibitively large computational and storage costs.We consider a family of convex loss functions for which the acquisition function can be efficiently approximated using the Laplace approximation of the posterior distribution.We show a variety of multiclass examples that illustrate improved performance over prior state-of-art.
基金Under the auspices of National Natural Science Foundation of China (No. 40671133)Fundamental Research Funds for the Central Universities (No. GK200902015)
文摘This paper proposed a semi-supervised regression model with co-training algorithm based on support vector machine, which was used for retrieving water quality variables from SPOT 5 remote sensing data. The model consisted of two support vector regressors (SVRs). Nonlinear relationship between water quality variables and SPOT 5 spectrum was described by the two SVRs, and semi-supervised co-training algorithm for the SVRs was es-tablished. The model was used for retrieving concentrations of four representative pollution indicators―permangan- ate index (CODmn), ammonia nitrogen (NH3-N), chemical oxygen demand (COD) and dissolved oxygen (DO) of the Weihe River in Shaanxi Province, China. The spatial distribution map for those variables over a part of the Weihe River was also produced. SVR can be used to implement any nonlinear mapping readily, and semi-supervis- ed learning can make use of both labeled and unlabeled samples. By integrating the two SVRs and using semi-supervised learning, we provide an operational method when paired samples are limited. The results show that it is much better than the multiple statistical regression method, and can provide the whole water pollution condi-tions for management fast and can be extended to hyperspectral remote sensing applications.
基金Project supported by the National Natural Science Foundation of China (Grant No.20503015).
文摘Co-training is a semi-supervised learning method, which employs two complementary learners to label the unlabeled data for each other and to predict the test sample together. Previous studies show that redundant information can help improve the ratio of prediction accuracy between semi-supervised learning methods and supervised learning methods. However, redundant information often practically hurts the performance of learning machines. This paper investigates what redundant features have effect on the semi-supervised learning methods, e.g. co-training, and how to remove the redundant features as well as the irrelevant features. Here, FESCOT (feature selection for co-training) is proposed to improve the generalization performance of co-training with feature selection. Experimental results on artificial and real world data sets show that FESCOT helps to remove irrelevant and redundant features that hurt the performance of the co-training method.
基金Supported by the National High Technology Research and Development Programme (No.2007AA12Z227) and the National Natural Science Foundation of China (No.40701146).
文摘These problems of nonlinearity, fuzziness and few labeled data were rarely considered in traditional remote sensing image classification. A semi-supervised kernel fuzzy C-means (SSKFCM) algorithm is proposed to overcome these disadvantages of remote sensing image classification in this paper. The SSKFCM algorithm is achieved by introducing a kernel method and semi-supervised learning technique into the standard fuzzy C-means (FCM) algorithm. A set of Beijing-1 micro-satellite's multispectral images are adopted to be classified by several algorithms, such as FCM, kernel FCM (KFCM), semi-supervised FCM (SSFCM) and SSKFCM. The classification results are estimated by corresponding indexes. The results indicate that the SSKFCM algorithm significantly improves the classification accuracy of remote sensing images compared with the others.
基金the National Natural Science Foundation of China(61761004)the Natural Science Foundation of Guangxi Province,China(2019GXNSFAA245045)。
文摘For large-scale radio frequency identification(RFID) indoor positioning system, the positioning scale is relatively large, with less labeled data and more unlabeled data, and it is easily affected by multipath and white noise. An RFID positioning algorithm based on semi-supervised actor-critic co-training(SACC) was proposed to solve this problem. In this research, the positioning is regarded as Markov decision-making process. Firstly, the actor-critic was combined with random actions and the unlabeled best received signal arrival intensity(RSSI) data was selected by co-training of the semi-supervised. Secondly, the actor and the critic were updated by employing Kronecker-factored approximation calculate(K-FAC) natural gradient. Finally, the target position was obtained by co-locating with labeled RSSI data and the selected unlabeled RSSI data. The proposed method reduced the cost of indoor positioning significantly by decreasing the number of labeled data. Meanwhile, with the increase of the positioning targets, the actor could quickly select unlabeled RSSI data and updates the location model. Experiment shows that, compared with other RFID indoor positioning algorithms, such as twin delayed deep deterministic policy gradient(TD3), deep deterministic policy gradient(DDPG), and actor-critic using Kronecker-factored trust region(ACKTR), the proposed method decreased the average positioning error respectively by 50.226%, 41.916%, and 25.004%. Meanwhile, the positioning stability was improved by 23.430%, 28.518%, and 38.631%.
基金Project supported by the National Key Research and Development Program of China(No.2016YFB0201305)the National Natural Science Foundation of China(No.61872376)。
文摘Deep learning models have achieved state-of-the-art performance in named entity recognition(NER);the good performance,however,relies heavily on substantial amounts of labeled data.In some specific areas such as medical,financial,and military domains,labeled data is very scarce,while unlabeled data is readily available.Previous studies have used unlabeled data to enrich word representations,but a large amount of entity information in unlabeled data is neglected,which may be beneficial to the NER task.In this study,we propose a semi-supervised method for NER tasks,which learns to create high-quality labeled data by applying a pre-trained module to filter out erroneous pseudo labels.Pseudo labels are automatically generated for unlabeled data and used as if they were true labels.Our semi-supervised framework includes three steps:constructing an optimal single neural model for a specific NER task,learning a module that evaluates pseudo labels,and creating new labeled data and improving the NER model iteratively.Experimental results on two English NER tasks and one Chinese clinical NER task demonstrate that our method further improves the performance of the best single neural model.Even when we use only pre-trained static word embeddings and do not rely on any external knowledge,our method achieves comparable performance to those state-of-the-art models on the CoNLL-2003 and OntoNotes 5.0 English NER tasks.
基金partially supported by the Transregional Collaborative Research Centre SFB/TRR 62 Companion-Technology for Cognitive Technical Systems funded by the German Research Foundation(DFG)supported by a scholarship of the German Academic Exchange Service(DAAD)
文摘Many data mining applications have a large amount of data but labeling data is usually difficult, expensive, or time consuming, as it requires human experts for annotation. Semi-supervised learning addresses this problem by using unlabeled data together with labeled data in the training process. Co-Training is a popular semi-supervised learning algorithm that has the assumptions that each example is represented by multiple sets of features (views) and these views are sufficient for learning and independent given the class. However, these assumptions axe strong and are not satisfied in many real-world domains. In this paper, a single-view variant of Co-Training, called Co-Training by Committee (CoBC) is proposed, in which an ensemble of diverse classifiers is used instead of redundant and independent views. We introduce a new labeling confidence measure for unlabeled examples based on estimating the local accuracy of the committee members on its neighborhood. Then we introduce two new learning algorithms, QBC-then-CoBC and QBC-with-CoBC, which combine the merits of committee-based semi-supervised learning and active learning. The random subspace method is applied on both C4.5 decision trees and 1-nearest neighbor classifiers to construct the diverse ensembles used for semi-supervised learning and active learning. Experiments show that these two combinations can outperform other non committee-based ones.
文摘Co-training is a famous semi-supervised learning algorithm which can exploit unlabeled data to improve learning performance.Generally it works under a two-view setting (the input examples have two disjoint feature sets in nature),with the assumption that each view is sufficient to predict the label.However,in real-world applications due to feature corruption or feature noise,both views may be insufficient and co-training will suffer from these insufficient views.In this paper,we propose a novel algorithm named Weighted Co-training to deal with this problem.It identifies the newly labeled examples that are probably harmful for the other view,and decreases their weights in the training set to avoid the risk.The experimental results show that Weighted Co-training performs better than the state-of-art co-training algorithms on several benchmarks.
文摘We advance here a novel methodology for robust intelligent biometric information management with inferences and predictions made using randomness and complexity concepts. Intelligence refers to learning, adap- tation, and functionality, and robustness refers to the ability to handle incomplete and/or corrupt adversarial information, on one side, and image and or device variability, on the other side. The proposed methodology is model-free and non-parametric. It draws support from discriminative methods using likelihood ratios to link at the conceptual level biometrics and forensics. It further links, at the modeling and implementation level, the Bayesian framework, statistical learning theory (SLT) using transduction and semi-supervised lea- rning, and Information Theory (IY) using mutual information. The key concepts supporting the proposed methodology are a) local estimation to facilitate learning and prediction using both labeled and unlabeled data;b) similarity metrics using regularity of patterns, randomness deficiency, and Kolmogorov complexity (similar to MDL) using strangeness/typicality and ranking p-values;and c) the Cover – Hart theorem on the asymptotical performance of k-nearest neighbors approaching the optimal Bayes error. Several topics on biometric inference and prediction related to 1) multi-level and multi-layer data fusion including quality and multi-modal biometrics;2) score normalization and revision theory;3) face selection and tracking;and 4) identity management, are described here using an integrated approach that includes transduction and boosting for ranking and sequential fusion/aggregation, respectively, on one side, and active learning and change/ outlier/intrusion detection realized using information gain and martingale, respectively, on the other side. The methodology proposed can be mapped to additional types of information beyond biometrics.