With the soaring generation of hazardous waste(HW)during industrialization and urbanization,HW illegal dumping continues to be an intractable global issue.Particularly in developing regions with lax regulations,it has...With the soaring generation of hazardous waste(HW)during industrialization and urbanization,HW illegal dumping continues to be an intractable global issue.Particularly in developing regions with lax regulations,it has become a major source of soil and groundwater contamination.One dominant challenge for HW illegal dumping supervision is the invisibility of dumping sites,which makes HW illegal dumping difficult to be found,thereby causing a long-term adverse impact on the environment.How to utilize the limited historic supervision records to screen the potential dumping sites in the whole region is a key challenge to be addressed.In this study,a novel machine learning model based on the positive-unlabeled(PU)learning algorithm was proposed to resolve this problem through the ensemble method which could iteratively mine the features of limited historic cases.Validation of the random forest-based PU model showed that the predicted top 30%of high-risk areas could cover 68.1%of newly reported cases in the studied region,indicating the reliability of the model prediction.This novel framework will also be promising in other environmental management scenarios to deal with numerous unknown samples based on limited prior experience.展开更多
In machine learning,positive-unlabelled(PU)learning is a special case within semi-supervised learning.In positiveunlabelled learning,the training set contains some positive examples and a set of unlabelled examples fr...In machine learning,positive-unlabelled(PU)learning is a special case within semi-supervised learning.In positiveunlabelled learning,the training set contains some positive examples and a set of unlabelled examples from both the positive and negative classes.Positive-unlabelled learning has gained attention in many domains,especially in time-series data,in which the obtainment of labelled data is challenging.Examples which originate from the negative class are especially difficult to acquire.Self-learning is a semi-supervised method capable of PU learning in time-series data.In the self-learning approach,observations are individually added from the unlabelled data into the positive class until a stopping criterion is reached.The model is retrained after each addition with the existent labels.The main problem in self-learning is to know when to stop the learning.There are multiple,different stopping criteria in the literature,but they tend to be inaccurate or challenging to apply.This publication proposes a novel stopping criterion,which is called Peak evaluation using perceptually important points,to address this problem for time-series data.Peak evaluation using perceptually important points is exceptional,as it does not have tunable hyperparameters,which makes it easily applicable to an unsupervised setting.Simultaneously,it is flexible as it does not make any assumptions on the balance of the dataset between the positive and the negative class.展开更多
基金the National Natural Science Foundation of China(71761147002,71921003,and 52270199)Jiangsu R&D Special Fund for Carbon Peaking and Carbon Neutrality(BK20220014)State Key Laboratory of Pollution Control and Resource Reuse(PCRRZZ-202109).
文摘With the soaring generation of hazardous waste(HW)during industrialization and urbanization,HW illegal dumping continues to be an intractable global issue.Particularly in developing regions with lax regulations,it has become a major source of soil and groundwater contamination.One dominant challenge for HW illegal dumping supervision is the invisibility of dumping sites,which makes HW illegal dumping difficult to be found,thereby causing a long-term adverse impact on the environment.How to utilize the limited historic supervision records to screen the potential dumping sites in the whole region is a key challenge to be addressed.In this study,a novel machine learning model based on the positive-unlabeled(PU)learning algorithm was proposed to resolve this problem through the ensemble method which could iteratively mine the features of limited historic cases.Validation of the random forest-based PU model showed that the predicted top 30%of high-risk areas could cover 68.1%of newly reported cases in the studied region,indicating the reliability of the model prediction.This novel framework will also be promising in other environmental management scenarios to deal with numerous unknown samples based on limited prior experience.
文摘In machine learning,positive-unlabelled(PU)learning is a special case within semi-supervised learning.In positiveunlabelled learning,the training set contains some positive examples and a set of unlabelled examples from both the positive and negative classes.Positive-unlabelled learning has gained attention in many domains,especially in time-series data,in which the obtainment of labelled data is challenging.Examples which originate from the negative class are especially difficult to acquire.Self-learning is a semi-supervised method capable of PU learning in time-series data.In the self-learning approach,observations are individually added from the unlabelled data into the positive class until a stopping criterion is reached.The model is retrained after each addition with the existent labels.The main problem in self-learning is to know when to stop the learning.There are multiple,different stopping criteria in the literature,but they tend to be inaccurate or challenging to apply.This publication proposes a novel stopping criterion,which is called Peak evaluation using perceptually important points,to address this problem for time-series data.Peak evaluation using perceptually important points is exceptional,as it does not have tunable hyperparameters,which makes it easily applicable to an unsupervised setting.Simultaneously,it is flexible as it does not make any assumptions on the balance of the dataset between the positive and the negative class.