Outlier detection is an important task in data mining. In fact, it is difficult to find the clustering centers in some sophisticated multidimensional datasets and to measure the deviation degree of each potential outl...Outlier detection is an important task in data mining. In fact, it is difficult to find the clustering centers in some sophisticated multidimensional datasets and to measure the deviation degree of each potential outlier. In this work, an effective outlier detection method based on multi-dimensional clustering and local density(ODBMCLD) is proposed. ODBMCLD firstly identifies the center objects by the local density peak of data objects, and clusters the whole dataset based on the center objects. Then, outlier objects belonging to different clusters will be marked as candidates of abnormal data. Finally, the top N points among these abnormal candidates are chosen as final anomaly objects with high outlier factors. The feasibility and effectiveness of the method are verified by experiments.展开更多
Changepoint detection faces challenges when outlier data are present. This paper proposes a multivariate changepoint detection method which is based on the robust WPCA projection direction and the robust RFPOP method,...Changepoint detection faces challenges when outlier data are present. This paper proposes a multivariate changepoint detection method which is based on the robust WPCA projection direction and the robust RFPOP method, RWPCA-RFPOP method. Our method is double robust which is suitable for detecting mean changepoints in multivariate normal data with high correlations between variables that include outliers. Simulation results demonstrate that our method provides strong guarantees on both the number and location of changepoints in the presence of outliers. Finally, our method is well applied in an ACGH dataset.展开更多
In order to effectively detect the privacy that may be leaked through social networks and avoid unnecessary harm to users,this paper takes microblog as the research object to study the detection of privacy disclosure ...In order to effectively detect the privacy that may be leaked through social networks and avoid unnecessary harm to users,this paper takes microblog as the research object to study the detection of privacy disclosure in social networks.First,we perform fast privacy leak detection on the currently published text based on the fastText model.In the case that the text to be published contains certain private information,we fully consider the aggregation effect of the private information leaked by different channels,and establish a convolution neural network model based on multi-dimensional features(MF-CNN)to detect privacy disclosure comprehensively and accurately.The experimental results show that the proposed method has a higher accuracy of privacy disclosure detection and can meet the real-time requirements of detection.展开更多
Although quality assurance and quality control procedures are routinely applied in most air quality networks, outliers can still occur due to instrument malfunctions, the influence of harsh environments and the limita...Although quality assurance and quality control procedures are routinely applied in most air quality networks, outliers can still occur due to instrument malfunctions, the influence of harsh environments and the limitation of measuring methods. Such outliers pose challenges for data-powered applications such as data assimilation, statistical analysis of pollution characteristics and ensemble forecasting. Here, a fully automatic outlier detection method was developed based on the probability of residuals, which are the discrepancies between the observed and the estimated concentration values. The estimation can be conducted using filtering—or regressions when appropriate—to discriminate four types of outliers characterized by temporal and spatial inconsistency, instrument-induced low variances, periodic calibration exceptions, and less PM_(10) than PM_(2.5) in concentration observations, respectively. This probabilistic method was applied to detect all four types of outliers in hourly surface measurements of six pollutants(PM_(2.5), PM_(10),SO_2,NO_2,CO and O_3) from 1436 stations of the China National Environmental Monitoring Network during 2014-16. Among the measurements, 0.65%-5.68% are marked as outliers. with PM_(10) and CO more prone to outliers. Our method successfully identifies a trend of decreasing outliers from 2014 to 2016,which corresponds to known improvements in the quality assurance and quality control procedures of the China National Environmental Monitoring Network. The outliers can have a significant impact on the annual mean concentrations of PM_(2.5),with differences exceeding 10 μg m^(-3) at 66 sites.展开更多
With the development of global position system(GPS),wireless technology and location aware services,it is possible to collect a large quantity of trajectory data.In the field of data mining for moving objects,the pr...With the development of global position system(GPS),wireless technology and location aware services,it is possible to collect a large quantity of trajectory data.In the field of data mining for moving objects,the problem of anomaly detection is a hot topic.Based on the development of anomalous trajectory detection of moving objects,this paper introduces the classical trajectory outlier detection(TRAOD) algorithm,and then proposes a density-based trajectory outlier detection(DBTOD) algorithm,which compensates the disadvantages of the TRAOD algorithm that it is unable to detect anomalous defects when the trajectory is local and dense.The results of employing the proposed algorithm to Elk1993 and Deer1995 datasets are also presented,which show the effectiveness of the algorithm.展开更多
With the development of data age,data quality has become one of the problems that people pay much attention to.As a field of data mining,outlier detection is related to the quality of data.The isolated forest algorith...With the development of data age,data quality has become one of the problems that people pay much attention to.As a field of data mining,outlier detection is related to the quality of data.The isolated forest algorithm is one of the more prominent numerical data outlier detection algorithms in recent years.In the process of constructing the isolation tree by the isolated forest algorithm,as the isolation tree is continuously generated,the difference of isolation trees will gradually decrease or even no difference,which will result in the waste of memory and reduced efficiency of outlier detection.And in the constructed isolation trees,some isolation trees cannot detect outlier.In this paper,an improved iForest-based method GA-iForest is proposed.This method optimizes the isolated forest by selecting some better isolation trees according to the detection accuracy and the difference of isolation trees,thereby reducing some duplicate,similar and poor detection isolation trees and improving the accuracy and stability of outlier detection.In the experiment,Ubuntu system and Spark platform are used to build the experiment environment.The outlier datasets provided by ODDS are used as test.According to indicators such as the accuracy,recall rate,ROC curves,AUC and execution time,the performance of the proposed method is evaluated.Experimental results show that the proposed method can not only improve the accuracy and stability of outlier detection,but also reduce the number of isolation trees by 20%-40%compared with the original iForest method.展开更多
In this paper, we propose a Packet Cache-Forward(PCF) method based on improved Bayesian outlier detection to eliminate out-of-order packets caused by transmission path drastically degradation during handover events in...In this paper, we propose a Packet Cache-Forward(PCF) method based on improved Bayesian outlier detection to eliminate out-of-order packets caused by transmission path drastically degradation during handover events in the moving satellite networks, for improving the performance of TCP. The proposed method uses an access node satellite to cache all received packets in a short time when handover occurs and forward them out in order. To calculate the cache time accurately, this paper establishes the Bayesian based mixture model for detecting delay outliers of the entire handover scheme. In view of the outliers' misjudgment, an updated classification threshold and the sliding window has been suggested to correct category collections and model parameters for the purpose of quickly identifying exact compensation delay in the varied network load statuses. Simulation shows that, comparing to average processing delay detection method, the average accuracy rate was scaled up by about 4.0%, and there is about 5.5% cut in error rate in the meantime. It also behaves well even though testing with big dataset. Benefiting from the advantage of the proposed scheme in terms of performance, comparing to conventional independent handover and network controlled synchronizedhandover in simulated LEO satellite networks, the proposed independent handover with PCF eliminates packet out-of-order issue to get better improvement on congestion window. Eventually the average delay decreases more than 70% and TCP performance has improved more than 300%.展开更多
The widespread adoption of the Internet of Things (IoT) has transformed various sectors globally, making themmore intelligent and connected. However, this advancement comes with challenges related to the effectiveness...The widespread adoption of the Internet of Things (IoT) has transformed various sectors globally, making themmore intelligent and connected. However, this advancement comes with challenges related to the effectiveness ofIoT devices. These devices, present in offices, homes, industries, and more, need constant monitoring to ensuretheir proper functionality. The success of smart systems relies on their seamless operation and ability to handlefaults. Sensors, crucial components of these systems, gather data and contribute to their functionality. Therefore,sensor faults can compromise the system’s reliability and undermine the trustworthiness of smart environments.To address these concerns, various techniques and algorithms can be employed to enhance the performance ofIoT devices through effective fault detection. This paper conducted a thorough review of the existing literature andconducted a detailed analysis.This analysis effectively links sensor errors with a prominent fault detection techniquecapable of addressing them. This study is innovative because it paves theway for future researchers to explore errorsthat have not yet been tackled by existing fault detection methods. Significant, the paper, also highlights essentialfactors for selecting and adopting fault detection techniques, as well as the characteristics of datasets and theircorresponding recommended techniques. Additionally, the paper presents amethodical overview of fault detectiontechniques employed in smart devices, including themetrics used for evaluation. Furthermore, the paper examinesthe body of academic work related to sensor faults and fault detection techniques within the domain. This reflectsthe growing inclination and scholarly attention of researchers and academicians toward strategies for fault detectionwithin the realm of the Internet of Things.展开更多
The distance-based outlier detection method detects the implied outliers by calculating the distance of the points in the dataset, but the computational complexity is particularly high when processing multidimensional...The distance-based outlier detection method detects the implied outliers by calculating the distance of the points in the dataset, but the computational complexity is particularly high when processing multidimensional datasets. In addition, the traditional outlier detection method does not consider the frequency of subsets occurrence, thus, the detected outliers do not fit the definition of outliers (i.e., rarely appearing). The pattern mining-based outlier detection approaches have solved this problem, but the importance of each pattern is not taken into account in outlier detection process, so the detected outliers cannot truly reflect some actual situation. Aimed at these problems, a two-phase minimal weighted rare pattern mining-based outlier detection approach, called MWRPM-Outlier, is proposed to effectively detect outliers on the weight data stream. In particular, a method called MWRPM is proposed in the pattern mining phase to fast mine the minimal weighted rare patterns, and then two deviation factors are defined in outlier detection phase to measure the abnormal degree of each transaction on the weight data stream. Experimental results show that the proposed MWRPM-Outlier approach has excellent performance in outlier detection and MWRPM approach outperforms in weighted rare pattern mining.展开更多
Various uncertainties arising during acquisition process of geoscience data may result in anomalous data instances(i.e.,outliers)that do not conform with the expected pattern of regular data instances.With sparse mult...Various uncertainties arising during acquisition process of geoscience data may result in anomalous data instances(i.e.,outliers)that do not conform with the expected pattern of regular data instances.With sparse multivariate data obtained from geotechnical site investigation,it is impossible to identify outliers with certainty due to the distortion of statistics of geotechnical parameters caused by outliers and their associated statistical uncertainty resulted from data sparsity.This paper develops a probabilistic outlier detection method for sparse multivariate data obtained from geotechnical site investigation.The proposed approach quantifies the outlying probability of each data instance based on Mahalanobis distance and determines outliers as those data instances with outlying probabilities greater than 0.5.It tackles the distortion issue of statistics estimated from the dataset with outliers by a re-sampling technique and accounts,rationally,for the statistical uncertainty by Bayesian machine learning.Moreover,the proposed approach also suggests an exclusive method to determine outlying components of each outlier.The proposed approach is illustrated and verified using simulated and real-life dataset.It showed that the proposed approach properly identifies outliers among sparse multivariate data and their corresponding outlying components in a probabilistic manner.It can significantly reduce the masking effect(i.e.,missing some actual outliers due to the distortion of statistics by the outliers and statistical uncertainty).It also found that outliers among sparse multivariate data instances affect significantly the construction of multivariate distribution of geotechnical parameters for uncertainty quantification.This emphasizes the necessity of data cleaning process(e.g.,outlier detection)for uncertainty quantification based on geoscience data.展开更多
Outlier detection has very important applied value in data mining literature. Different outlier detection algorithms based on distinct theories have different definitions and mining processes. The three-dimensional sp...Outlier detection has very important applied value in data mining literature. Different outlier detection algorithms based on distinct theories have different definitions and mining processes. The three-dimensional space graph for constructing applied algorithms and an improved GridOf algorithm were proposed in terms of analyzing the existing outlier detection algorithms from criterion and theory. Key words outlier - detection - three-dimensional space graph - data mining CLC number TP 311. 13 - TP 391 Foundation item: Supported by the National Natural Science Foundation of China (70371015)Biography: ZHANG Jing (1975-), female, Ph. D, lecturer, research direction: data mining and knowledge discovery.展开更多
In this study,we propose a low-cost system that can detect the space outlier utilization of residents in an indoor environment.We focus on the users’app usage to analyze unusual behavior,especially in indoor spaces.T...In this study,we propose a low-cost system that can detect the space outlier utilization of residents in an indoor environment.We focus on the users’app usage to analyze unusual behavior,especially in indoor spaces.This is reflected in the behavioral analysis in that the frequency of using smartphones in personal spaces has recently increased.Our system facilitates autonomous data collection from mobile app logs and Google app servers and generates a high-dimensional dataset that can detect outlier behaviors.The density-based spatial clustering of applications with noise(DBSCAN)algorithm was applied for effective singular movement analysis.To analyze high-level mobile phone usage,the t-distributed stochastic neighbor embedding(t-SNE)algorithm was employed.These two clustering algorithms can effectively detect outlier behaviors in terms of movement and app usage in indoor spaces.The experimental results showed that our system enables effective spatial behavioral analysis at a low cost when applied to logs collected in actual living spaces.Moreover,large volumes of data required for outlier detection can be easily acquired.The system can automatically detect the unusual behavior of a user in an indoor space.In particular,this study aims to reflect the recent trend of the increasing use of smartphones in indoor spaces to the behavioral analysis.展开更多
With the development of science and technology,the status of the water environment has received more and more attention.In this paper,we propose a deep learning model,named a Joint Auto-Encoder network,to solve the pr...With the development of science and technology,the status of the water environment has received more and more attention.In this paper,we propose a deep learning model,named a Joint Auto-Encoder network,to solve the problem of outlier detection in water supply data.The Joint Auto-Encoder network first expands the size of training data and extracts the useful features from the input data,and then reconstructs the input data effectively into an output.The outliers are detected based on the network’s reconstruction errors,with a larger reconstruction error indicating a higher rate to be an outlier.For water supply data,there are mainly two types of outliers:outliers with large values and those with values closed to zero.We set two separate thresholds,and,for the reconstruction errors to detect the two types of outliers respectively.The data samples with reconstruction errors exceeding the thresholds are voted to be outliers.The two thresholds can be calculated by the classification confusion matrix and the receiver operating characteristic(ROC)curve.We have also performed comparisons between the Joint Auto-Encoder and the vanilla Auto-Encoder in this paper on both the synthesis data set and the MNIST data set.As a result,our model has proved to outperform the vanilla Auto-Encoder and some other outlier detection approaches with the recall rate of 98.94 percent in water supply data.展开更多
On the: basis of wavelet theory, we propose an outlier-detection algorithm for satellite gravity ometry by applying a wavelet-shrinkage-de-noising method to some simulation data with white noise and ers. The result S...On the: basis of wavelet theory, we propose an outlier-detection algorithm for satellite gravity ometry by applying a wavelet-shrinkage-de-noising method to some simulation data with white noise and ers. The result Shows that this novel algorithm has a 97% success rate in outlier identification and that be efficiently used for pre-processing real satellite gravity gradiometry data.展开更多
The heterogeneous nodes in the Internet of Things(IoT)are relatively weak in the computing power and storage capacity.Therefore,traditional algorithms of network security are not suitable for the IoT.Once these nodes ...The heterogeneous nodes in the Internet of Things(IoT)are relatively weak in the computing power and storage capacity.Therefore,traditional algorithms of network security are not suitable for the IoT.Once these nodes alternate between normal behavior and anomaly behavior,it is difficult to identify and isolate them by the network system in a short time,thus the data transmission accuracy and the integrity of the network function will be affected negatively.Based on the characteristics of IoT,a lightweight local outlier factor detection method is used for node detection.In order to further determine whether the nodes are an anomaly or not,the varying behavior of those nodes in terms of time is considered in this research,and a time series method is used to make the system respond to the randomness and selectiveness of anomaly behavior nodes effectively in a short period of time.Simulation results show that the proposed method can improve the accuracy of the data transmitted by the network and achieve better performance.展开更多
The detection of outliers and change points from time series has become research focus in the area of time series data mining since it can be used for fraud detection, rare event discovery, event/trend change detectio...The detection of outliers and change points from time series has become research focus in the area of time series data mining since it can be used for fraud detection, rare event discovery, event/trend change detection, etc. In most previous works, outlier detection and change point detection have not been related explicitly and the change point detections did not consider the influence of outliers, in this work, a unified detection framework was presented to deal with both of them. The framework is based on ALARCON-AQUINO and BARRIA's change points detection method and adopts two-stage detection to divide the outliers and change points. The advantages of it lie in that: firstly, unified structure for change detection and outlier detection further reduces the computational complexity and make the detective procedure simple; Secondly, the detection strategy of outlier detection before change point detection avoids the influence of outliers to the change point detection, and thus improves the accuracy of the change point detection. The simulation experiments of the proposed method for both model data and actual application data have been made and gotten 100% detection accuracy. The comparisons between traditional detection method and the proposed method further demonstrate that the unified detection structure is more accurate when the time series are contaminated by outliers.展开更多
Data reconciliation technology can decrease the level of corruption of process data due to measurement noise, but the presence of outliers caused by process peaks or unmeasured disturbances will smear the reconciled r...Data reconciliation technology can decrease the level of corruption of process data due to measurement noise, but the presence of outliers caused by process peaks or unmeasured disturbances will smear the reconciled results. Based on the analysis of limitation of conventional outlier detection algorithms, a modified outlier detection method in dynamic data reconciliation (DDR) is proposed in this paper. In the modified method, the outliers of each variable are distinguished individually and the weight is modified accordingly. Therefore, the modified method can use more information of normal data, and can efficiently decrease the effect of outliers. Simulation of a continuous stirred tank reactor (CSTR) process verifies the effectiveness of the proposed algorithm.展开更多
Node localization is commonly employed in wireless networks. For example, it is used to improve routing and enhance security. Localization algorithms can be classified as range-free or range-based. Range-based algorit...Node localization is commonly employed in wireless networks. For example, it is used to improve routing and enhance security. Localization algorithms can be classified as range-free or range-based. Range-based algorithms use location metrics such as ToA, TDoA, RSS, and AoA to estimate the distance between two nodes. Proximity sensing between nodes is typically the basis for range-free algorithms. A tradeoff exists since range-based algorithms are more accurate but also more complex. However, in applications such as target tracking, localization accuracy is very important. In this paper, we propose a new range-based algorithm which is based on the density-based outlier detection algorithm (DBOD) from data mining. It requires selection of the K-nearest neighbours (KNN). DBOD assigns density values to each point used in the location estimation. The mean of these densities is calculated and those points having a density larger than the mean are kept as candidate points. Different performance measures are used to compare our approach with the linear least squares (LLS) and weighted linear least squares based on singular value decomposition (WLS-SVD) algorithms. It is shown that the proposed algorithm performs better than these algorithms even when the anchor geometry about an unlocalized node is poor.展开更多
We introduce a new wavelet based procedure for detecting outliers in financial discrete time series.The procedure focuses on the analysis of residuals obtained from a model fit,and applied to the Generalized Autoregre...We introduce a new wavelet based procedure for detecting outliers in financial discrete time series.The procedure focuses on the analysis of residuals obtained from a model fit,and applied to the Generalized Autoregressive Conditional Heteroskedasticity(GARCH)like model,but not limited to these models.We apply the Maximal-Overlap Discrete Wavelet Transform(MODWT)to the residuals and compare their wavelet coefficients against quantile thresholds to detect outliers.Our methodology has several advantages over existing methods that make use of the standard Discrete Wavelet Transform(DWT).The series sample size does not need to be a power of 2 and the transform can explore any wavelet filter and be run up to the desired level.Simulated wavelet quantiles from a Normal and Student t-distribution are used as threshold for the maximum of the absolute value of wavelet coefficients.The performance of the procedure is illustrated and applied to two real series:the closed price of the Saudi Stock market and the S&P 500 index respectively.The efficiency of the proposed method is demonstrated and can be considered as a distinct important addition to the existing methods.展开更多
We introduce and develop a novel approach to outlier detection based on adaptation of random subspace learning. Our proposed method handles both high-dimension low-sample size and traditional low-dimensional high-samp...We introduce and develop a novel approach to outlier detection based on adaptation of random subspace learning. Our proposed method handles both high-dimension low-sample size and traditional low-dimensional high-sample size datasets. Essentially, we avoid the computational bottleneck of techniques like Minimum Covariance Determinant (MCD) by computing the needed determinants and associated measures in much lower dimensional subspaces. Both theoretical and computational development of our approach reveal that it is computationally more efficient than the regularized methods in high-dimensional low-sample size, and often competes favorably with existing methods as far as the percentage of correct outlier detection are concerned.展开更多
基金Project(61362021)supported by the National Natural Science Foundation of ChinaProject(2016GXNSFAA380149)supported by Natural Science Foundation of Guangxi Province,China+1 种基金Projects(2016YJCXB02,2017YJCX34)supported by Innovation Project of GUET Graduate Education,ChinaProject(2011KF11)supported by the Key Laboratory of Cognitive Radio and Information Processing,Ministry of Education,China
文摘Outlier detection is an important task in data mining. In fact, it is difficult to find the clustering centers in some sophisticated multidimensional datasets and to measure the deviation degree of each potential outlier. In this work, an effective outlier detection method based on multi-dimensional clustering and local density(ODBMCLD) is proposed. ODBMCLD firstly identifies the center objects by the local density peak of data objects, and clusters the whole dataset based on the center objects. Then, outlier objects belonging to different clusters will be marked as candidates of abnormal data. Finally, the top N points among these abnormal candidates are chosen as final anomaly objects with high outlier factors. The feasibility and effectiveness of the method are verified by experiments.
文摘Changepoint detection faces challenges when outlier data are present. This paper proposes a multivariate changepoint detection method which is based on the robust WPCA projection direction and the robust RFPOP method, RWPCA-RFPOP method. Our method is double robust which is suitable for detecting mean changepoints in multivariate normal data with high correlations between variables that include outliers. Simulation results demonstrate that our method provides strong guarantees on both the number and location of changepoints in the presence of outliers. Finally, our method is well applied in an ACGH dataset.
基金This work was supported by the National Natural Science Foundation of China(No.61672101)the Beijing Key Laboratory of Internet Culture and Digital Dissemination Research(ICDDXN004)Key Lab of Information Network Security,Ministry of Public Security,China(No.C18601).
文摘In order to effectively detect the privacy that may be leaked through social networks and avoid unnecessary harm to users,this paper takes microblog as the research object to study the detection of privacy disclosure in social networks.First,we perform fast privacy leak detection on the currently published text based on the fastText model.In the case that the text to be published contains certain private information,we fully consider the aggregation effect of the private information leaked by different channels,and establish a convolution neural network model based on multi-dimensional features(MF-CNN)to detect privacy disclosure comprehensively and accurately.The experimental results show that the proposed method has a higher accuracy of privacy disclosure detection and can meet the real-time requirements of detection.
基金supported by the National Natural Science Foundation (Grant Nos.91644216 and 41575128)the CAS Information Technology Program (Grant No.XXH13506-302)Guangdong Provincial Science and Technology Development Special Fund (No.2017B020216007)
文摘Although quality assurance and quality control procedures are routinely applied in most air quality networks, outliers can still occur due to instrument malfunctions, the influence of harsh environments and the limitation of measuring methods. Such outliers pose challenges for data-powered applications such as data assimilation, statistical analysis of pollution characteristics and ensemble forecasting. Here, a fully automatic outlier detection method was developed based on the probability of residuals, which are the discrepancies between the observed and the estimated concentration values. The estimation can be conducted using filtering—or regressions when appropriate—to discriminate four types of outliers characterized by temporal and spatial inconsistency, instrument-induced low variances, periodic calibration exceptions, and less PM_(10) than PM_(2.5) in concentration observations, respectively. This probabilistic method was applied to detect all four types of outliers in hourly surface measurements of six pollutants(PM_(2.5), PM_(10),SO_2,NO_2,CO and O_3) from 1436 stations of the China National Environmental Monitoring Network during 2014-16. Among the measurements, 0.65%-5.68% are marked as outliers. with PM_(10) and CO more prone to outliers. Our method successfully identifies a trend of decreasing outliers from 2014 to 2016,which corresponds to known improvements in the quality assurance and quality control procedures of the China National Environmental Monitoring Network. The outliers can have a significant impact on the annual mean concentrations of PM_(2.5),with differences exceeding 10 μg m^(-3) at 66 sites.
基金supported by the Aeronautical Science Foundation of China(20111052010)the Jiangsu Graduates Innovation Project (CXZZ120163)+1 种基金the "333" Project of Jiangsu Provincethe Qing Lan Project of Jiangsu Province
文摘With the development of global position system(GPS),wireless technology and location aware services,it is possible to collect a large quantity of trajectory data.In the field of data mining for moving objects,the problem of anomaly detection is a hot topic.Based on the development of anomalous trajectory detection of moving objects,this paper introduces the classical trajectory outlier detection(TRAOD) algorithm,and then proposes a density-based trajectory outlier detection(DBTOD) algorithm,which compensates the disadvantages of the TRAOD algorithm that it is unable to detect anomalous defects when the trajectory is local and dense.The results of employing the proposed algorithm to Elk1993 and Deer1995 datasets are also presented,which show the effectiveness of the algorithm.
基金supported by the State Grid Liaoning Electric Power Supply CO, LTDthe financial support for the “Key Technology and Application Research of the Self-Service Grid Big Data Governance (No.SGLNXT00YJJS1800110)”
文摘With the development of data age,data quality has become one of the problems that people pay much attention to.As a field of data mining,outlier detection is related to the quality of data.The isolated forest algorithm is one of the more prominent numerical data outlier detection algorithms in recent years.In the process of constructing the isolation tree by the isolated forest algorithm,as the isolation tree is continuously generated,the difference of isolation trees will gradually decrease or even no difference,which will result in the waste of memory and reduced efficiency of outlier detection.And in the constructed isolation trees,some isolation trees cannot detect outlier.In this paper,an improved iForest-based method GA-iForest is proposed.This method optimizes the isolated forest by selecting some better isolation trees according to the detection accuracy and the difference of isolation trees,thereby reducing some duplicate,similar and poor detection isolation trees and improving the accuracy and stability of outlier detection.In the experiment,Ubuntu system and Spark platform are used to build the experiment environment.The outlier datasets provided by ODDS are used as test.According to indicators such as the accuracy,recall rate,ROC curves,AUC and execution time,the performance of the proposed method is evaluated.Experimental results show that the proposed method can not only improve the accuracy and stability of outlier detection,but also reduce the number of isolation trees by 20%-40%compared with the original iForest method.
基金supported by National High Technology Research and Development Program of China(863 Program,No.2014AA7011005)National Nature Science Foundation of China(No.91438120)
文摘In this paper, we propose a Packet Cache-Forward(PCF) method based on improved Bayesian outlier detection to eliminate out-of-order packets caused by transmission path drastically degradation during handover events in the moving satellite networks, for improving the performance of TCP. The proposed method uses an access node satellite to cache all received packets in a short time when handover occurs and forward them out in order. To calculate the cache time accurately, this paper establishes the Bayesian based mixture model for detecting delay outliers of the entire handover scheme. In view of the outliers' misjudgment, an updated classification threshold and the sliding window has been suggested to correct category collections and model parameters for the purpose of quickly identifying exact compensation delay in the varied network load statuses. Simulation shows that, comparing to average processing delay detection method, the average accuracy rate was scaled up by about 4.0%, and there is about 5.5% cut in error rate in the meantime. It also behaves well even though testing with big dataset. Benefiting from the advantage of the proposed scheme in terms of performance, comparing to conventional independent handover and network controlled synchronizedhandover in simulated LEO satellite networks, the proposed independent handover with PCF eliminates packet out-of-order issue to get better improvement on congestion window. Eventually the average delay decreases more than 70% and TCP performance has improved more than 300%.
文摘The widespread adoption of the Internet of Things (IoT) has transformed various sectors globally, making themmore intelligent and connected. However, this advancement comes with challenges related to the effectiveness ofIoT devices. These devices, present in offices, homes, industries, and more, need constant monitoring to ensuretheir proper functionality. The success of smart systems relies on their seamless operation and ability to handlefaults. Sensors, crucial components of these systems, gather data and contribute to their functionality. Therefore,sensor faults can compromise the system’s reliability and undermine the trustworthiness of smart environments.To address these concerns, various techniques and algorithms can be employed to enhance the performance ofIoT devices through effective fault detection. This paper conducted a thorough review of the existing literature andconducted a detailed analysis.This analysis effectively links sensor errors with a prominent fault detection techniquecapable of addressing them. This study is innovative because it paves theway for future researchers to explore errorsthat have not yet been tackled by existing fault detection methods. Significant, the paper, also highlights essentialfactors for selecting and adopting fault detection techniques, as well as the characteristics of datasets and theircorresponding recommended techniques. Additionally, the paper presents amethodical overview of fault detectiontechniques employed in smart devices, including themetrics used for evaluation. Furthermore, the paper examinesthe body of academic work related to sensor faults and fault detection techniques within the domain. This reflectsthe growing inclination and scholarly attention of researchers and academicians toward strategies for fault detectionwithin the realm of the Internet of Things.
基金supported by Fundamental Research Funds for the Central Universities (No. 2018XD004)
文摘The distance-based outlier detection method detects the implied outliers by calculating the distance of the points in the dataset, but the computational complexity is particularly high when processing multidimensional datasets. In addition, the traditional outlier detection method does not consider the frequency of subsets occurrence, thus, the detected outliers do not fit the definition of outliers (i.e., rarely appearing). The pattern mining-based outlier detection approaches have solved this problem, but the importance of each pattern is not taken into account in outlier detection process, so the detected outliers cannot truly reflect some actual situation. Aimed at these problems, a two-phase minimal weighted rare pattern mining-based outlier detection approach, called MWRPM-Outlier, is proposed to effectively detect outliers on the weight data stream. In particular, a method called MWRPM is proposed in the pattern mining phase to fast mine the minimal weighted rare patterns, and then two deviation factors are defined in outlier detection phase to measure the abnormal degree of each transaction on the weight data stream. Experimental results show that the proposed MWRPM-Outlier approach has excellent performance in outlier detection and MWRPM approach outperforms in weighted rare pattern mining.
基金supported by the National Key R&D Program of China(Project No.2016YFC0800200)the NRF-NSFC 3rd Joint Research Grant(Earth Science)(Project No.41861144022)+2 种基金the National Natural Science Foundation of China(Project Nos.51679174,and 51779189)the Shenzhen Key Technology R&D Program(Project No.20170324)The financial support is grateful acknowledged。
文摘Various uncertainties arising during acquisition process of geoscience data may result in anomalous data instances(i.e.,outliers)that do not conform with the expected pattern of regular data instances.With sparse multivariate data obtained from geotechnical site investigation,it is impossible to identify outliers with certainty due to the distortion of statistics of geotechnical parameters caused by outliers and their associated statistical uncertainty resulted from data sparsity.This paper develops a probabilistic outlier detection method for sparse multivariate data obtained from geotechnical site investigation.The proposed approach quantifies the outlying probability of each data instance based on Mahalanobis distance and determines outliers as those data instances with outlying probabilities greater than 0.5.It tackles the distortion issue of statistics estimated from the dataset with outliers by a re-sampling technique and accounts,rationally,for the statistical uncertainty by Bayesian machine learning.Moreover,the proposed approach also suggests an exclusive method to determine outlying components of each outlier.The proposed approach is illustrated and verified using simulated and real-life dataset.It showed that the proposed approach properly identifies outliers among sparse multivariate data and their corresponding outlying components in a probabilistic manner.It can significantly reduce the masking effect(i.e.,missing some actual outliers due to the distortion of statistics by the outliers and statistical uncertainty).It also found that outliers among sparse multivariate data instances affect significantly the construction of multivariate distribution of geotechnical parameters for uncertainty quantification.This emphasizes the necessity of data cleaning process(e.g.,outlier detection)for uncertainty quantification based on geoscience data.
文摘Outlier detection has very important applied value in data mining literature. Different outlier detection algorithms based on distinct theories have different definitions and mining processes. The three-dimensional space graph for constructing applied algorithms and an improved GridOf algorithm were proposed in terms of analyzing the existing outlier detection algorithms from criterion and theory. Key words outlier - detection - three-dimensional space graph - data mining CLC number TP 311. 13 - TP 391 Foundation item: Supported by the National Natural Science Foundation of China (70371015)Biography: ZHANG Jing (1975-), female, Ph. D, lecturer, research direction: data mining and knowledge discovery.
文摘In this study,we propose a low-cost system that can detect the space outlier utilization of residents in an indoor environment.We focus on the users’app usage to analyze unusual behavior,especially in indoor spaces.This is reflected in the behavioral analysis in that the frequency of using smartphones in personal spaces has recently increased.Our system facilitates autonomous data collection from mobile app logs and Google app servers and generates a high-dimensional dataset that can detect outlier behaviors.The density-based spatial clustering of applications with noise(DBSCAN)algorithm was applied for effective singular movement analysis.To analyze high-level mobile phone usage,the t-distributed stochastic neighbor embedding(t-SNE)algorithm was employed.These two clustering algorithms can effectively detect outlier behaviors in terms of movement and app usage in indoor spaces.The experimental results showed that our system enables effective spatial behavioral analysis at a low cost when applied to logs collected in actual living spaces.Moreover,large volumes of data required for outlier detection can be easily acquired.The system can automatically detect the unusual behavior of a user in an indoor space.In particular,this study aims to reflect the recent trend of the increasing use of smartphones in indoor spaces to the behavioral analysis.
基金The work described in this paper was supported by the National Natural Science Foundation of China(NSFC)under Grant No.U1501253 and Grant No.U1713217.
文摘With the development of science and technology,the status of the water environment has received more and more attention.In this paper,we propose a deep learning model,named a Joint Auto-Encoder network,to solve the problem of outlier detection in water supply data.The Joint Auto-Encoder network first expands the size of training data and extracts the useful features from the input data,and then reconstructs the input data effectively into an output.The outliers are detected based on the network’s reconstruction errors,with a larger reconstruction error indicating a higher rate to be an outlier.For water supply data,there are mainly two types of outliers:outliers with large values and those with values closed to zero.We set two separate thresholds,and,for the reconstruction errors to detect the two types of outliers respectively.The data samples with reconstruction errors exceeding the thresholds are voted to be outliers.The two thresholds can be calculated by the classification confusion matrix and the receiver operating characteristic(ROC)curve.We have also performed comparisons between the Joint Auto-Encoder and the vanilla Auto-Encoder in this paper on both the synthesis data set and the MNIST data set.As a result,our model has proved to outperform the vanilla Auto-Encoder and some other outlier detection approaches with the recall rate of 98.94 percent in water supply data.
基金supported by the Director Foundation of the Institute of Seismology,China Earthquake Administration (IS201126025)The Basis Research Foundation of Key laboratory of Geospace Environment & Geodesy Ministry of Education,China (10-01-09)
文摘On the: basis of wavelet theory, we propose an outlier-detection algorithm for satellite gravity ometry by applying a wavelet-shrinkage-de-noising method to some simulation data with white noise and ers. The result Shows that this novel algorithm has a 97% success rate in outlier identification and that be efficiently used for pre-processing real satellite gravity gradiometry data.
基金This work is partially supported by the Ministry of Education of China(www.moe.gov.cn)under grant Nos.201802123091(received by F.W.)and 201802123068(received by Z.W.)Scientific Project of CAFUC(www.cafuc.edu.cn)under grant Nos.F2017KF02 and J2018-3(both received by Z.W.)Teaching Reform Project of CAFUC(www.cafuc.edu.cn)under grant No.E2020044(received by Z.W.).
文摘The heterogeneous nodes in the Internet of Things(IoT)are relatively weak in the computing power and storage capacity.Therefore,traditional algorithms of network security are not suitable for the IoT.Once these nodes alternate between normal behavior and anomaly behavior,it is difficult to identify and isolate them by the network system in a short time,thus the data transmission accuracy and the integrity of the network function will be affected negatively.Based on the characteristics of IoT,a lightweight local outlier factor detection method is used for node detection.In order to further determine whether the nodes are an anomaly or not,the varying behavior of those nodes in terms of time is considered in this research,and a time series method is used to make the system respond to the randomness and selectiveness of anomaly behavior nodes effectively in a short period of time.Simulation results show that the proposed method can improve the accuracy of the data transmitted by the network and achieve better performance.
基金Project(2011AA040603) supported by the National High Technology Ressarch & Development Program of ChinaProject(201202226) supported by the Natural Science Foundation of Liaoning Province, China
文摘The detection of outliers and change points from time series has become research focus in the area of time series data mining since it can be used for fraud detection, rare event discovery, event/trend change detection, etc. In most previous works, outlier detection and change point detection have not been related explicitly and the change point detections did not consider the influence of outliers, in this work, a unified detection framework was presented to deal with both of them. The framework is based on ALARCON-AQUINO and BARRIA's change points detection method and adopts two-stage detection to divide the outliers and change points. The advantages of it lie in that: firstly, unified structure for change detection and outlier detection further reduces the computational complexity and make the detective procedure simple; Secondly, the detection strategy of outlier detection before change point detection avoids the influence of outliers to the change point detection, and thus improves the accuracy of the change point detection. The simulation experiments of the proposed method for both model data and actual application data have been made and gotten 100% detection accuracy. The comparisons between traditional detection method and the proposed method further demonstrate that the unified detection structure is more accurate when the time series are contaminated by outliers.
基金Supported by the National Outstanding Youth Science Foundation of China (No. 60025308) and Key Technologies R&DProgram in the 10th Five-year Plan (No. 2001BA204B07)
文摘Data reconciliation technology can decrease the level of corruption of process data due to measurement noise, but the presence of outliers caused by process peaks or unmeasured disturbances will smear the reconciled results. Based on the analysis of limitation of conventional outlier detection algorithms, a modified outlier detection method in dynamic data reconciliation (DDR) is proposed in this paper. In the modified method, the outliers of each variable are distinguished individually and the weight is modified accordingly. Therefore, the modified method can use more information of normal data, and can efficiently decrease the effect of outliers. Simulation of a continuous stirred tank reactor (CSTR) process verifies the effectiveness of the proposed algorithm.
文摘Node localization is commonly employed in wireless networks. For example, it is used to improve routing and enhance security. Localization algorithms can be classified as range-free or range-based. Range-based algorithms use location metrics such as ToA, TDoA, RSS, and AoA to estimate the distance between two nodes. Proximity sensing between nodes is typically the basis for range-free algorithms. A tradeoff exists since range-based algorithms are more accurate but also more complex. However, in applications such as target tracking, localization accuracy is very important. In this paper, we propose a new range-based algorithm which is based on the density-based outlier detection algorithm (DBOD) from data mining. It requires selection of the K-nearest neighbours (KNN). DBOD assigns density values to each point used in the location estimation. The mean of these densities is calculated and those points having a density larger than the mean are kept as candidate points. Different performance measures are used to compare our approach with the linear least squares (LLS) and weighted linear least squares based on singular value decomposition (WLS-SVD) algorithms. It is shown that the proposed algorithm performs better than these algorithms even when the anchor geometry about an unlocalized node is poor.
文摘We introduce a new wavelet based procedure for detecting outliers in financial discrete time series.The procedure focuses on the analysis of residuals obtained from a model fit,and applied to the Generalized Autoregressive Conditional Heteroskedasticity(GARCH)like model,but not limited to these models.We apply the Maximal-Overlap Discrete Wavelet Transform(MODWT)to the residuals and compare their wavelet coefficients against quantile thresholds to detect outliers.Our methodology has several advantages over existing methods that make use of the standard Discrete Wavelet Transform(DWT).The series sample size does not need to be a power of 2 and the transform can explore any wavelet filter and be run up to the desired level.Simulated wavelet quantiles from a Normal and Student t-distribution are used as threshold for the maximum of the absolute value of wavelet coefficients.The performance of the procedure is illustrated and applied to two real series:the closed price of the Saudi Stock market and the S&P 500 index respectively.The efficiency of the proposed method is demonstrated and can be considered as a distinct important addition to the existing methods.
文摘We introduce and develop a novel approach to outlier detection based on adaptation of random subspace learning. Our proposed method handles both high-dimension low-sample size and traditional low-dimensional high-sample size datasets. Essentially, we avoid the computational bottleneck of techniques like Minimum Covariance Determinant (MCD) by computing the needed determinants and associated measures in much lower dimensional subspaces. Both theoretical and computational development of our approach reveal that it is computationally more efficient than the regularized methods in high-dimensional low-sample size, and often competes favorably with existing methods as far as the percentage of correct outlier detection are concerned.