Ensemble learning is a wildly concerned issue.Traditional ensemble techniques are always adopted to seek better results with labeled data and base classifiers.They fail to address the ensemble task where only unlabele...Ensemble learning is a wildly concerned issue.Traditional ensemble techniques are always adopted to seek better results with labeled data and base classifiers.They fail to address the ensemble task where only unlabeled data are available.A label propagation based ensemble(LPBE) approach is proposed to further combine base classification results with unlabeled data.First,a graph is constructed by taking unlabeled data as vertexes,and the weights in the graph are calculated by correntropy function.Average prediction results are gained from base classifiers,and then propagated under a regularization framework and adaptively enhanced over the graph.The proposed approach is further enriched when small labeled data are available.The proposed algorithms are evaluated on several UCI benchmark data sets.Results of simulations show that the proposed algorithms achieve satisfactory performance compared with existing ensemble methods.展开更多
Identifying and correcting grammatical errors in the text written by non-native writers have received increasing attention in recent years. Although a number of annotated corpora have been established to facilitate da...Identifying and correcting grammatical errors in the text written by non-native writers have received increasing attention in recent years. Although a number of annotated corpora have been established to facilitate data-driven grammatical error detection and correction approaches, they are still limited in terms of quantity and coverage because human annotation is labor-intensive, time-consuming, and expensive. In this work, we propose to utilize unlabeled data to train neural network based grammatical error detection models. The basic idea is to cast error detection as a binary classification problem and derive positive and negative training examples from unlabeled data. We introduce an attention-based neural network to capture long-distance dependencies that influence the word being detected. Experiments show that the proposed approach significantly outperforms SVM and convolutional networks with fixed-size context window.展开更多
In standard canonical correlation analysis (CCA), the data from definite datasets are used to estimate their canonical correlation. In real applications, for example in bilingual text retrieval, it may have a great po...In standard canonical correlation analysis (CCA), the data from definite datasets are used to estimate their canonical correlation. In real applications, for example in bilingual text retrieval, it may have a great portion of data that we do not know which set it belongs to. This part of data is called unlabeled data, while the rest from definite datasets is called labeled data. We propose a novel method called regularized canonical correlation analysis (RCCA), which makes use of both labeled and unlabeled samples. Specifically, we learn to approximate canonical correlation as if all data were labeled. Then, we describe a generalization of RCCA for the multi-set situation. Experiments on four real world datasets, Yeast, Cloud, Iris, and Haberman, demonstrate that, by incorporating the unlabeled data points, the accuracy of correlation coefficients can be improved by over 30%.展开更多
Nowadays,large numbers of smart sensors(e.g.,road-side cameras)which com-municate with nearby base stations could launch distributed denial of services(DDoS)attack storms in intelligent transportation systems.DDoS att...Nowadays,large numbers of smart sensors(e.g.,road-side cameras)which com-municate with nearby base stations could launch distributed denial of services(DDoS)attack storms in intelligent transportation systems.DDoS attacks disable the services provided by base stations.Thus in this paper,considering the uneven communication traffic ows and privacy preserving,we give a hidden Markov model-based prediction model by utilizing the multi-step characteristic of DDoS with a federated learning framework to predict whether DDoS attacks will happen on base stations in the future.However,in the federated learning,we need to consider the problem of poisoning attacks due to malicious participants.The poisoning attacks will lead to the intelligent transportation systems paralysis without security protection.Traditional poisoning attacks mainly apply to the classi cation model with labeled data.In this paper,we propose a reinforcement learning-based poisoningmethod speci cally for poisoning the prediction model with unlabeled data.Besides,previous related defense strategies rely on validation datasets with labeled data in the server.However,it is unrealistic since the local training datasets are not uploaded to the server due to privacy preserving,and our datasets are also unlabeled.Furthermore,we give a validation dataset-free defense strategy based on Dempster-Shafer(D-S)evidence theory avoiding anomaly aggregation to obtain a robust global model for precise DDoS prediction.In our experiments,we simulate 3000 points in combination with DARPA2000 dataset to carry out evaluations.The results indicate that our poisoning method can successfully poison the global prediction model with unlabeled data in a short time.Meanwhile,we compare our proposed defense algorithm with three popularly used defense algorithms.The results show that our defense method has a high accuracy rate of excluding poisoners and can obtain a high attack prediction probability.展开更多
For the classification problem in practice,one of the challenging issues is to obtain enough labeled data for training.Moreover,even if such labeled data has been sufficiently accumulated,most datasets often exhibit l...For the classification problem in practice,one of the challenging issues is to obtain enough labeled data for training.Moreover,even if such labeled data has been sufficiently accumulated,most datasets often exhibit long-tailed distribution with heavy class imbalance,which results in a biased model towards a majority class.To alleviate such class imbalance,semisupervised learning methods using additional unlabeled data have been considered.However,as a matter of course,the accuracy is much lower than that from supervised learning.In this study,under the assumption that additional unlabeled data is available,we propose the iterative semi-supervised learning algorithms,which iteratively correct the labeling of the extra unlabeled data based on softmax probabilities.The results show that the proposed algorithms provide the accuracy as high as that from the supervised learning.To validate the proposed algorithms,we tested on the two scenarios:with the balanced unlabeled dataset and with the imbalanced unlabeled dataset.Under both scenarios,our proposed semi-supervised learning algorithms provided higher accuracy than previous state-of-the-arts.Code is available at https://github.com/HeewonChung92/iterative-semi-learning.展开更多
An unsupervised clustering\|based intrusion detection algorithm is discussed in this paper. The basic idea of the algorithm is to produce the cluster by comparing the distances of unlabeled training data sets. With th...An unsupervised clustering\|based intrusion detection algorithm is discussed in this paper. The basic idea of the algorithm is to produce the cluster by comparing the distances of unlabeled training data sets. With the classified data instances, anomaly data clusters can be easily identified by normal cluster ratio and the identified cluster can be used in real data detection. The benefit of the algorithm is that it doesn't need labeled training data sets. The experiment concludes that this approach can detect unknown intrusions efficiently in the real network connections via using the data sets of KDD99.展开更多
Geoscientists are increasingly tasked with spatially predicting a target variable in the presence of auxiliary information using supervised machine learning algorithms.Typically,the target variable is observed at a fe...Geoscientists are increasingly tasked with spatially predicting a target variable in the presence of auxiliary information using supervised machine learning algorithms.Typically,the target variable is observed at a few sampling locations due to the relatively time-consuming and costly process of obtaining measurements.In contrast,auxiliary variables are often exhaustively observed within the region under study through the increasing development of remote sensing platforms and sensor networks.Supervised machine learning methods do not fully leverage this large amount of auxiliary spatial data.Indeed,in these methods,the training dataset includes only labeled data locations(where both target and auxiliary variables were measured).At the same time,unlabeled data locations(where auxiliary variables were measured but not the target variable)are not considered during the model training phase.Consequently,only a limited amount of auxiliary spatial data is utilized during the model training stage.As an alternative to supervised learning,semi-supervised learning,which learns from labeled as well as unlabeled data,can be used to address this problem.However,conventional semi-supervised learning techniques do not account for the specificities of spatial data.This paper introduces a spatial semi-supervised learning framework where geostatistics and machine learning are combined to harness a large amount of unlabeled spatial data in combination with typically a smaller set of labeled spatial data.The main idea consists of leveraging the target variable’s spatial autocorrelation to generate pseudo labels at unlabeled data points that are geographically close to labeled data points.This is achieved through geostatistical conditional simulation,where an ensemble of pseudo labels is generated to account for the uncertainty in the pseudo labeling process.The observed labels are augmented by this ensemble of pseudo labels to create an ensemble of pseudo training datasets.A supervised machine learning model is then trained on each pseudo training dataset,followed by an aggregation of trained models.The proposed geostatistical semi-supervised learning method is applied to synthetic and real-world spatial datasets.Its predictive performance is compared with some classical supervised and semi-supervised machine learning methods.It appears that it can effectively leverage a large amount of unlabeled spatial data to improve the target variable’s spatial prediction.展开更多
Deep learning models have achieved state-of-the-art performance in named entity recognition(NER);the good performance,however,relies heavily on substantial amounts of labeled data.In some specific areas such as medica...Deep learning models have achieved state-of-the-art performance in named entity recognition(NER);the good performance,however,relies heavily on substantial amounts of labeled data.In some specific areas such as medical,financial,and military domains,labeled data is very scarce,while unlabeled data is readily available.Previous studies have used unlabeled data to enrich word representations,but a large amount of entity information in unlabeled data is neglected,which may be beneficial to the NER task.In this study,we propose a semi-supervised method for NER tasks,which learns to create high-quality labeled data by applying a pre-trained module to filter out erroneous pseudo labels.Pseudo labels are automatically generated for unlabeled data and used as if they were true labels.Our semi-supervised framework includes three steps:constructing an optimal single neural model for a specific NER task,learning a module that evaluates pseudo labels,and creating new labeled data and improving the NER model iteratively.Experimental results on two English NER tasks and one Chinese clinical NER task demonstrate that our method further improves the performance of the best single neural model.Even when we use only pre-trained static word embeddings and do not rely on any external knowledge,our method achieves comparable performance to those state-of-the-art models on the CoNLL-2003 and OntoNotes 5.0 English NER tasks.展开更多
基金Project (20121101004) supported by the Major Science and Technology Program of Shanxi Province,ChinaProject (20130321004-01) supported by the Key Technologies R&D Program of Shanxi Province,China+2 种基金Project (2013M530896) supported by the Postdoctoral Science Foundation of ChinaProject (2014021022-6) supported by the Shanxi Provincial Science Foundation for Youths,ChinaProject (80010302010053) supported by the Shanxi Characteristic Discipline Fund,China
文摘Ensemble learning is a wildly concerned issue.Traditional ensemble techniques are always adopted to seek better results with labeled data and base classifiers.They fail to address the ensemble task where only unlabeled data are available.A label propagation based ensemble(LPBE) approach is proposed to further combine base classification results with unlabeled data.First,a graph is constructed by taking unlabeled data as vertexes,and the weights in the graph are calculated by correntropy function.Average prediction results are gained from base classifiers,and then propagated under a regularization framework and adaptively enhanced over the graph.The proposed approach is further enriched when small labeled data are available.The proposed algorithms are evaluated on several UCI benchmark data sets.Results of simulations show that the proposed algorithms achieve satisfactory performance compared with existing ensemble methods.
文摘Identifying and correcting grammatical errors in the text written by non-native writers have received increasing attention in recent years. Although a number of annotated corpora have been established to facilitate data-driven grammatical error detection and correction approaches, they are still limited in terms of quantity and coverage because human annotation is labor-intensive, time-consuming, and expensive. In this work, we propose to utilize unlabeled data to train neural network based grammatical error detection models. The basic idea is to cast error detection as a binary classification problem and derive positive and negative training examples from unlabeled data. We introduce an attention-based neural network to capture long-distance dependencies that influence the word being detected. Experiments show that the proposed approach significantly outperforms SVM and convolutional networks with fixed-size context window.
基金Project (No. 5959438) supported by Microsoft (China) Co., Ltd
文摘In standard canonical correlation analysis (CCA), the data from definite datasets are used to estimate their canonical correlation. In real applications, for example in bilingual text retrieval, it may have a great portion of data that we do not know which set it belongs to. This part of data is called unlabeled data, while the rest from definite datasets is called labeled data. We propose a novel method called regularized canonical correlation analysis (RCCA), which makes use of both labeled and unlabeled samples. Specifically, we learn to approximate canonical correlation as if all data were labeled. Then, we describe a generalization of RCCA for the multi-set situation. Experiments on four real world datasets, Yeast, Cloud, Iris, and Haberman, demonstrate that, by incorporating the unlabeled data points, the accuracy of correlation coefficients can be improved by over 30%.
基金the National Key Research and Development Project(2018YFB2100801)in part by the National Natural Science Foundation of China(61972080)+1 种基金in part by the Shanghai Rising-Star Program(19QA1400300)in part by the Open Research Project from the Key Laboratory of the Ministry of Education for Embedded System and Service Computing(ESSCKF2021-01).
文摘Nowadays,large numbers of smart sensors(e.g.,road-side cameras)which com-municate with nearby base stations could launch distributed denial of services(DDoS)attack storms in intelligent transportation systems.DDoS attacks disable the services provided by base stations.Thus in this paper,considering the uneven communication traffic ows and privacy preserving,we give a hidden Markov model-based prediction model by utilizing the multi-step characteristic of DDoS with a federated learning framework to predict whether DDoS attacks will happen on base stations in the future.However,in the federated learning,we need to consider the problem of poisoning attacks due to malicious participants.The poisoning attacks will lead to the intelligent transportation systems paralysis without security protection.Traditional poisoning attacks mainly apply to the classi cation model with labeled data.In this paper,we propose a reinforcement learning-based poisoningmethod speci cally for poisoning the prediction model with unlabeled data.Besides,previous related defense strategies rely on validation datasets with labeled data in the server.However,it is unrealistic since the local training datasets are not uploaded to the server due to privacy preserving,and our datasets are also unlabeled.Furthermore,we give a validation dataset-free defense strategy based on Dempster-Shafer(D-S)evidence theory avoiding anomaly aggregation to obtain a robust global model for precise DDoS prediction.In our experiments,we simulate 3000 points in combination with DARPA2000 dataset to carry out evaluations.The results indicate that our poisoning method can successfully poison the global prediction model with unlabeled data in a short time.Meanwhile,we compare our proposed defense algorithm with three popularly used defense algorithms.The results show that our defense method has a high accuracy rate of excluding poisoners and can obtain a high attack prediction probability.
基金This work was supported by the National Research Foundation of Korea(No.2020R1A2C1014829)by the Korea Medical Device Development Fund grant,which is funded by the Government of the Republic of Korea Korea government(the Ministry of Science and ICT+2 种基金the Ministry of Trade,Industry and Energythe Ministry of Health and Welfareand the Ministry of Food and Drug Safety)(grant KMDF_PR_20200901_0095).
文摘For the classification problem in practice,one of the challenging issues is to obtain enough labeled data for training.Moreover,even if such labeled data has been sufficiently accumulated,most datasets often exhibit long-tailed distribution with heavy class imbalance,which results in a biased model towards a majority class.To alleviate such class imbalance,semisupervised learning methods using additional unlabeled data have been considered.However,as a matter of course,the accuracy is much lower than that from supervised learning.In this study,under the assumption that additional unlabeled data is available,we propose the iterative semi-supervised learning algorithms,which iteratively correct the labeling of the extra unlabeled data based on softmax probabilities.The results show that the proposed algorithms provide the accuracy as high as that from the supervised learning.To validate the proposed algorithms,we tested on the two scenarios:with the balanced unlabeled dataset and with the imbalanced unlabeled dataset.Under both scenarios,our proposed semi-supervised learning algorithms provided higher accuracy than previous state-of-the-arts.Code is available at https://github.com/HeewonChung92/iterative-semi-learning.
文摘An unsupervised clustering\|based intrusion detection algorithm is discussed in this paper. The basic idea of the algorithm is to produce the cluster by comparing the distances of unlabeled training data sets. With the classified data instances, anomaly data clusters can be easily identified by normal cluster ratio and the identified cluster can be used in real data detection. The benefit of the algorithm is that it doesn't need labeled training data sets. The experiment concludes that this approach can detect unknown intrusions efficiently in the real network connections via using the data sets of KDD99.
文摘Geoscientists are increasingly tasked with spatially predicting a target variable in the presence of auxiliary information using supervised machine learning algorithms.Typically,the target variable is observed at a few sampling locations due to the relatively time-consuming and costly process of obtaining measurements.In contrast,auxiliary variables are often exhaustively observed within the region under study through the increasing development of remote sensing platforms and sensor networks.Supervised machine learning methods do not fully leverage this large amount of auxiliary spatial data.Indeed,in these methods,the training dataset includes only labeled data locations(where both target and auxiliary variables were measured).At the same time,unlabeled data locations(where auxiliary variables were measured but not the target variable)are not considered during the model training phase.Consequently,only a limited amount of auxiliary spatial data is utilized during the model training stage.As an alternative to supervised learning,semi-supervised learning,which learns from labeled as well as unlabeled data,can be used to address this problem.However,conventional semi-supervised learning techniques do not account for the specificities of spatial data.This paper introduces a spatial semi-supervised learning framework where geostatistics and machine learning are combined to harness a large amount of unlabeled spatial data in combination with typically a smaller set of labeled spatial data.The main idea consists of leveraging the target variable’s spatial autocorrelation to generate pseudo labels at unlabeled data points that are geographically close to labeled data points.This is achieved through geostatistical conditional simulation,where an ensemble of pseudo labels is generated to account for the uncertainty in the pseudo labeling process.The observed labels are augmented by this ensemble of pseudo labels to create an ensemble of pseudo training datasets.A supervised machine learning model is then trained on each pseudo training dataset,followed by an aggregation of trained models.The proposed geostatistical semi-supervised learning method is applied to synthetic and real-world spatial datasets.Its predictive performance is compared with some classical supervised and semi-supervised machine learning methods.It appears that it can effectively leverage a large amount of unlabeled spatial data to improve the target variable’s spatial prediction.
基金Project supported by the National Key Research and Development Program of China(No.2016YFB0201305)the National Natural Science Foundation of China(No.61872376)。
文摘Deep learning models have achieved state-of-the-art performance in named entity recognition(NER);the good performance,however,relies heavily on substantial amounts of labeled data.In some specific areas such as medical,financial,and military domains,labeled data is very scarce,while unlabeled data is readily available.Previous studies have used unlabeled data to enrich word representations,but a large amount of entity information in unlabeled data is neglected,which may be beneficial to the NER task.In this study,we propose a semi-supervised method for NER tasks,which learns to create high-quality labeled data by applying a pre-trained module to filter out erroneous pseudo labels.Pseudo labels are automatically generated for unlabeled data and used as if they were true labels.Our semi-supervised framework includes three steps:constructing an optimal single neural model for a specific NER task,learning a module that evaluates pseudo labels,and creating new labeled data and improving the NER model iteratively.Experimental results on two English NER tasks and one Chinese clinical NER task demonstrate that our method further improves the performance of the best single neural model.Even when we use only pre-trained static word embeddings and do not rely on any external knowledge,our method achieves comparable performance to those state-of-the-art models on the CoNLL-2003 and OntoNotes 5.0 English NER tasks.