In the field of medical images,pixel-level labels are time-consuming and expensive to acquire,while image-level labels are relatively easier to obtain.Therefore,it makes sense to learn more information(knowledge)from ...In the field of medical images,pixel-level labels are time-consuming and expensive to acquire,while image-level labels are relatively easier to obtain.Therefore,it makes sense to learn more information(knowledge)from a small number of hard-to-get pixel-level annotated images to apply to different tasks to maximize their usefulness and save time and training costs.In this paper,using Pixel-Level Labeled Images forMulti-Task Learning(PLDMLT),we focus on grading the severity of fundus images for Diabetic Retinopathy(DR).This is because,for the segmentation task,there is a finely labeled mask,while the severity grading task is without classification labels.To this end,we propose a two-stage multi-label learning weakly supervised algorithm,which generates initial classification pseudo labels in the first stage and visualizes heat maps at all levels of severity using Grad-Cam to further provide medical interpretability for the classification task.A multitask model framework with U-net as the baseline is proposed in the second stage.A label update network is designed to alleviate the gradient balance between the classification and segmentation tasks.Extensive experimental results show that our PLDMLTmethod significantly outperforms other stateof-the-art methods in DR segmentation on two public datasets,achieving up to 98.897%segmentation accuracy.In addition,our method achieves comparable competitiveness with single-task fully supervised learning in the DR severity grading task.展开更多
Geoscientists are increasingly tasked with spatially predicting a target variable in the presence of auxiliary information using supervised machine learning algorithms.Typically,the target variable is observed at a fe...Geoscientists are increasingly tasked with spatially predicting a target variable in the presence of auxiliary information using supervised machine learning algorithms.Typically,the target variable is observed at a few sampling locations due to the relatively time-consuming and costly process of obtaining measurements.In contrast,auxiliary variables are often exhaustively observed within the region under study through the increasing development of remote sensing platforms and sensor networks.Supervised machine learning methods do not fully leverage this large amount of auxiliary spatial data.Indeed,in these methods,the training dataset includes only labeled data locations(where both target and auxiliary variables were measured).At the same time,unlabeled data locations(where auxiliary variables were measured but not the target variable)are not considered during the model training phase.Consequently,only a limited amount of auxiliary spatial data is utilized during the model training stage.As an alternative to supervised learning,semi-supervised learning,which learns from labeled as well as unlabeled data,can be used to address this problem.However,conventional semi-supervised learning techniques do not account for the specificities of spatial data.This paper introduces a spatial semi-supervised learning framework where geostatistics and machine learning are combined to harness a large amount of unlabeled spatial data in combination with typically a smaller set of labeled spatial data.The main idea consists of leveraging the target variable’s spatial autocorrelation to generate pseudo labels at unlabeled data points that are geographically close to labeled data points.This is achieved through geostatistical conditional simulation,where an ensemble of pseudo labels is generated to account for the uncertainty in the pseudo labeling process.The observed labels are augmented by this ensemble of pseudo labels to create an ensemble of pseudo training datasets.A supervised machine learning model is then trained on each pseudo training dataset,followed by an aggregation of trained models.The proposed geostatistical semi-supervised learning method is applied to synthetic and real-world spatial datasets.Its predictive performance is compared with some classical supervised and semi-supervised machine learning methods.It appears that it can effectively leverage a large amount of unlabeled spatial data to improve the target variable’s spatial prediction.展开更多
As a subtask of open domain event extraction(ODEE),new event type induction aims to discover a set of unseen event types from a given corpus.Existing methods mostly adopt semi-supervised or unsupervised learning to ac...As a subtask of open domain event extraction(ODEE),new event type induction aims to discover a set of unseen event types from a given corpus.Existing methods mostly adopt semi-supervised or unsupervised learning to achieve the goal,which uses complex and different objective functions for labeled and unlabeled data respectively.In order to unify and simplify objective functions,a reliable pseudo-labeling prediction(RPP)framework for new event type induction was proposed.The framework introduces a double label reassignment(DLR)strategy for unlabeled data based on swap-prediction.DLR strategy can alleviate the model degeneration caused by swap-predication and further combine the real distribution over unseen event types to produce more reliable pseudo labels for unlabeled data.The generated reliable pseudo labels help the overall model be optimized by a unified and simple objective.Experiments show that RPP framework outperforms the state-of-the-art on the benchmark.展开更多
Active anomaly detection queries labels of sampled instances and uses them to incrementally update the detection model,and has been widely adopted in detecting network attacks.However,existing methods cannot achieve d...Active anomaly detection queries labels of sampled instances and uses them to incrementally update the detection model,and has been widely adopted in detecting network attacks.However,existing methods cannot achieve desirable performance on dynamic network traffic streams because(1)their query strategies cannot sample informative instances to make the detection model adapt to the evolving stream and(2)their model updating relies on limited query instances only and fails to leverage the enormous unlabeled instances on streams.To address these issues,we propose an active tree based model,adaptive and augmented active prior-knowledge forest(A3PF),for anomaly detection on network trafic streams.A prior-knowledge forest is constructed using prior knowledge of network attacks to find feature subspaces that better distinguish network anomalies from normal traffic.On one hand,to make the model adapt to the evolving stream,a novel adaptive query strategy is designed to sample informative instances from two aspects:the changes in dynamic data distribution and the uncertainty of anomalies.On the other hand,based on the similarity of instances in the neighborhood,we devise an augmented update method to generate pseudo labels for the unlabeled neighbors of query instances,which enables usage of the enormous unlabeled instances during model updating.Extensive experiments on two benchmarks,CIC-IDS2017 and UNSW-NB15,demonstrate that A3PF achieves significant improvements over previous active methods in terms of the area under the receiver operating characteristic curve(AUC-ROC)(20.9%and 21.5%)and the area under the precision-recall curve(AUC-PR)(44.6%and 64.1%).展开更多
Medical named entity recognition(NER)is an area in which medical named entities are recognized from medical texts,such as diseases,drugs,surgery reports,anatomical parts,and examination documents.Conventional medical ...Medical named entity recognition(NER)is an area in which medical named entities are recognized from medical texts,such as diseases,drugs,surgery reports,anatomical parts,and examination documents.Conventional medical NER methods do not make full use of un-labelled medical texts embedded in medical documents.To address this issue,we proposed a medical NER approach based on pre-trained language models and a domain dictionary.First,we constructed a medical entity dictionary by extracting medical entities from labelled medical texts and collecting medical entities from other resources,such as the YiduN4 K data set.Second,we employed this dictionary to train domain-specific pre-trained language models using un-labelled medical texts.Third,we employed a pseudo labelling mechanism in un-labelled medical texts to automatically annotate texts and create pseudo labels.Fourth,the BiLSTM-CRF sequence tagging model was used to fine-tune the pre-trained language models.Our experiments on the un-labelled medical texts,which were extracted from Chinese electronic medical records,show that the proposed NER approach enables the strict and relaxed F1 scores to be 88.7%and 95.3%,respectively.展开更多
文摘In the field of medical images,pixel-level labels are time-consuming and expensive to acquire,while image-level labels are relatively easier to obtain.Therefore,it makes sense to learn more information(knowledge)from a small number of hard-to-get pixel-level annotated images to apply to different tasks to maximize their usefulness and save time and training costs.In this paper,using Pixel-Level Labeled Images forMulti-Task Learning(PLDMLT),we focus on grading the severity of fundus images for Diabetic Retinopathy(DR).This is because,for the segmentation task,there is a finely labeled mask,while the severity grading task is without classification labels.To this end,we propose a two-stage multi-label learning weakly supervised algorithm,which generates initial classification pseudo labels in the first stage and visualizes heat maps at all levels of severity using Grad-Cam to further provide medical interpretability for the classification task.A multitask model framework with U-net as the baseline is proposed in the second stage.A label update network is designed to alleviate the gradient balance between the classification and segmentation tasks.Extensive experimental results show that our PLDMLTmethod significantly outperforms other stateof-the-art methods in DR segmentation on two public datasets,achieving up to 98.897%segmentation accuracy.In addition,our method achieves comparable competitiveness with single-task fully supervised learning in the DR severity grading task.
文摘Geoscientists are increasingly tasked with spatially predicting a target variable in the presence of auxiliary information using supervised machine learning algorithms.Typically,the target variable is observed at a few sampling locations due to the relatively time-consuming and costly process of obtaining measurements.In contrast,auxiliary variables are often exhaustively observed within the region under study through the increasing development of remote sensing platforms and sensor networks.Supervised machine learning methods do not fully leverage this large amount of auxiliary spatial data.Indeed,in these methods,the training dataset includes only labeled data locations(where both target and auxiliary variables were measured).At the same time,unlabeled data locations(where auxiliary variables were measured but not the target variable)are not considered during the model training phase.Consequently,only a limited amount of auxiliary spatial data is utilized during the model training stage.As an alternative to supervised learning,semi-supervised learning,which learns from labeled as well as unlabeled data,can be used to address this problem.However,conventional semi-supervised learning techniques do not account for the specificities of spatial data.This paper introduces a spatial semi-supervised learning framework where geostatistics and machine learning are combined to harness a large amount of unlabeled spatial data in combination with typically a smaller set of labeled spatial data.The main idea consists of leveraging the target variable’s spatial autocorrelation to generate pseudo labels at unlabeled data points that are geographically close to labeled data points.This is achieved through geostatistical conditional simulation,where an ensemble of pseudo labels is generated to account for the uncertainty in the pseudo labeling process.The observed labels are augmented by this ensemble of pseudo labels to create an ensemble of pseudo training datasets.A supervised machine learning model is then trained on each pseudo training dataset,followed by an aggregation of trained models.The proposed geostatistical semi-supervised learning method is applied to synthetic and real-world spatial datasets.Its predictive performance is compared with some classical supervised and semi-supervised machine learning methods.It appears that it can effectively leverage a large amount of unlabeled spatial data to improve the target variable’s spatial prediction.
基金supported by the National Natural Science Foundation of China(62076031)。
文摘As a subtask of open domain event extraction(ODEE),new event type induction aims to discover a set of unseen event types from a given corpus.Existing methods mostly adopt semi-supervised or unsupervised learning to achieve the goal,which uses complex and different objective functions for labeled and unlabeled data respectively.In order to unify and simplify objective functions,a reliable pseudo-labeling prediction(RPP)framework for new event type induction was proposed.The framework introduces a double label reassignment(DLR)strategy for unlabeled data based on swap-prediction.DLR strategy can alleviate the model degeneration caused by swap-predication and further combine the real distribution over unseen event types to produce more reliable pseudo labels for unlabeled data.The generated reliable pseudo labels help the overall model be optimized by a unified and simple objective.Experiments show that RPP framework outperforms the state-of-the-art on the benchmark.
基金Project supported by the National Science and Technology Major Project(No.2022ZD0115302)the National Natural Science Foundation of China(No.61379052)+1 种基金the Science Foundation of Ministry of Education of China(No.2018A02002)the Natural Science Foundation for Distinguished Young Scholars of Hunan Province,China(No.14JJ1026)。
文摘Active anomaly detection queries labels of sampled instances and uses them to incrementally update the detection model,and has been widely adopted in detecting network attacks.However,existing methods cannot achieve desirable performance on dynamic network traffic streams because(1)their query strategies cannot sample informative instances to make the detection model adapt to the evolving stream and(2)their model updating relies on limited query instances only and fails to leverage the enormous unlabeled instances on streams.To address these issues,we propose an active tree based model,adaptive and augmented active prior-knowledge forest(A3PF),for anomaly detection on network trafic streams.A prior-knowledge forest is constructed using prior knowledge of network attacks to find feature subspaces that better distinguish network anomalies from normal traffic.On one hand,to make the model adapt to the evolving stream,a novel adaptive query strategy is designed to sample informative instances from two aspects:the changes in dynamic data distribution and the uncertainty of anomalies.On the other hand,based on the similarity of instances in the neighborhood,we devise an augmented update method to generate pseudo labels for the unlabeled neighbors of query instances,which enables usage of the enormous unlabeled instances during model updating.Extensive experiments on two benchmarks,CIC-IDS2017 and UNSW-NB15,demonstrate that A3PF achieves significant improvements over previous active methods in terms of the area under the receiver operating characteristic curve(AUC-ROC)(20.9%and 21.5%)and the area under the precision-recall curve(AUC-PR)(44.6%and 64.1%).
基金This work is supported in part by the Guangdong Science and Technology grant(No.2016A010101033)the Hong Kong and Macao joint research and development grant with Wuyi University(No.2019WGAH21).
文摘Medical named entity recognition(NER)is an area in which medical named entities are recognized from medical texts,such as diseases,drugs,surgery reports,anatomical parts,and examination documents.Conventional medical NER methods do not make full use of un-labelled medical texts embedded in medical documents.To address this issue,we proposed a medical NER approach based on pre-trained language models and a domain dictionary.First,we constructed a medical entity dictionary by extracting medical entities from labelled medical texts and collecting medical entities from other resources,such as the YiduN4 K data set.Second,we employed this dictionary to train domain-specific pre-trained language models using un-labelled medical texts.Third,we employed a pseudo labelling mechanism in un-labelled medical texts to automatically annotate texts and create pseudo labels.Fourth,the BiLSTM-CRF sequence tagging model was used to fine-tune the pre-trained language models.Our experiments on the un-labelled medical texts,which were extracted from Chinese electronic medical records,show that the proposed NER approach enables the strict and relaxed F1 scores to be 88.7%and 95.3%,respectively.