Outlier detection has very important applied value in data mining literature. Different outlier detection algorithms based on distinct theories have different definitions and mining processes. The three-dimensional sp...Outlier detection has very important applied value in data mining literature. Different outlier detection algorithms based on distinct theories have different definitions and mining processes. The three-dimensional space graph for constructing applied algorithms and an improved GridOf algorithm were proposed in terms of analyzing the existing outlier detection algorithms from criterion and theory. Key words outlier - detection - three-dimensional space graph - data mining CLC number TP 311. 13 - TP 391 Foundation item: Supported by the National Natural Science Foundation of China (70371015)Biography: ZHANG Jing (1975-), female, Ph. D, lecturer, research direction: data mining and knowledge discovery.展开更多
Various uncertainties arising during acquisition process of geoscience data may result in anomalous data instances(i.e.,outliers)that do not conform with the expected pattern of regular data instances.With sparse mult...Various uncertainties arising during acquisition process of geoscience data may result in anomalous data instances(i.e.,outliers)that do not conform with the expected pattern of regular data instances.With sparse multivariate data obtained from geotechnical site investigation,it is impossible to identify outliers with certainty due to the distortion of statistics of geotechnical parameters caused by outliers and their associated statistical uncertainty resulted from data sparsity.This paper develops a probabilistic outlier detection method for sparse multivariate data obtained from geotechnical site investigation.The proposed approach quantifies the outlying probability of each data instance based on Mahalanobis distance and determines outliers as those data instances with outlying probabilities greater than 0.5.It tackles the distortion issue of statistics estimated from the dataset with outliers by a re-sampling technique and accounts,rationally,for the statistical uncertainty by Bayesian machine learning.Moreover,the proposed approach also suggests an exclusive method to determine outlying components of each outlier.The proposed approach is illustrated and verified using simulated and real-life dataset.It showed that the proposed approach properly identifies outliers among sparse multivariate data and their corresponding outlying components in a probabilistic manner.It can significantly reduce the masking effect(i.e.,missing some actual outliers due to the distortion of statistics by the outliers and statistical uncertainty).It also found that outliers among sparse multivariate data instances affect significantly the construction of multivariate distribution of geotechnical parameters for uncertainty quantification.This emphasizes the necessity of data cleaning process(e.g.,outlier detection)for uncertainty quantification based on geoscience data.展开更多
The distance-based outlier detection method detects the implied outliers by calculating the distance of the points in the dataset, but the computational complexity is particularly high when processing multidimensional...The distance-based outlier detection method detects the implied outliers by calculating the distance of the points in the dataset, but the computational complexity is particularly high when processing multidimensional datasets. In addition, the traditional outlier detection method does not consider the frequency of subsets occurrence, thus, the detected outliers do not fit the definition of outliers (i.e., rarely appearing). The pattern mining-based outlier detection approaches have solved this problem, but the importance of each pattern is not taken into account in outlier detection process, so the detected outliers cannot truly reflect some actual situation. Aimed at these problems, a two-phase minimal weighted rare pattern mining-based outlier detection approach, called MWRPM-Outlier, is proposed to effectively detect outliers on the weight data stream. In particular, a method called MWRPM is proposed in the pattern mining phase to fast mine the minimal weighted rare patterns, and then two deviation factors are defined in outlier detection phase to measure the abnormal degree of each transaction on the weight data stream. Experimental results show that the proposed MWRPM-Outlier approach has excellent performance in outlier detection and MWRPM approach outperforms in weighted rare pattern mining.展开更多
With the development of science and technology,the status of the water environment has received more and more attention.In this paper,we propose a deep learning model,named a Joint Auto-Encoder network,to solve the pr...With the development of science and technology,the status of the water environment has received more and more attention.In this paper,we propose a deep learning model,named a Joint Auto-Encoder network,to solve the problem of outlier detection in water supply data.The Joint Auto-Encoder network first expands the size of training data and extracts the useful features from the input data,and then reconstructs the input data effectively into an output.The outliers are detected based on the network’s reconstruction errors,with a larger reconstruction error indicating a higher rate to be an outlier.For water supply data,there are mainly two types of outliers:outliers with large values and those with values closed to zero.We set two separate thresholds,and,for the reconstruction errors to detect the two types of outliers respectively.The data samples with reconstruction errors exceeding the thresholds are voted to be outliers.The two thresholds can be calculated by the classification confusion matrix and the receiver operating characteristic(ROC)curve.We have also performed comparisons between the Joint Auto-Encoder and the vanilla Auto-Encoder in this paper on both the synthesis data set and the MNIST data set.As a result,our model has proved to outperform the vanilla Auto-Encoder and some other outlier detection approaches with the recall rate of 98.94 percent in water supply data.展开更多
Data reconciliation technology can decrease the level of corruption of process data due to measurement noise, but the presence of outliers caused by process peaks or unmeasured disturbances will smear the reconciled r...Data reconciliation technology can decrease the level of corruption of process data due to measurement noise, but the presence of outliers caused by process peaks or unmeasured disturbances will smear the reconciled results. Based on the analysis of limitation of conventional outlier detection algorithms, a modified outlier detection method in dynamic data reconciliation (DDR) is proposed in this paper. In the modified method, the outliers of each variable are distinguished individually and the weight is modified accordingly. Therefore, the modified method can use more information of normal data, and can efficiently decrease the effect of outliers. Simulation of a continuous stirred tank reactor (CSTR) process verifies the effectiveness of the proposed algorithm.展开更多
Outlier detection is a key research area in data mining technologies,as outlier detection can identify data inconsistent within a data set.Outlier detection aims to find an abnormal data size from a large data size an...Outlier detection is a key research area in data mining technologies,as outlier detection can identify data inconsistent within a data set.Outlier detection aims to find an abnormal data size from a large data size and has been applied in many fields including fraud detection,network intrusion detection,disaster prediction,medical diagnosis,public security,and image processing.While outlier detection has been widely applied in real systems,its effectiveness is challenged by higher dimensions and redundant data attributes,leading to detection errors and complicated calculations.The prevalence of mixed data is a current issue for outlier detection algorithms.An outlier detection method of mixed data based on neighborhood combinatorial entropy is studied to improve outlier detection performance by reducing data dimension using an attribute reduction algorithm.The significance of attributes is determined,and fewer influencing attributes are removed based on neighborhood combinatorial entropy.Outlier detection is conducted using the algorithm of local outlier factor.The proposed outlier detection method can be applied effectively in numerical and mixed multidimensional data using neighborhood combinatorial entropy.In the experimental part of this paper,we give a comparison on outlier detection before and after attribute reduction.In a comparative analysis,we give results of the enhanced outlier detection accuracy by removing the fewer influencing attributes in numerical and mixed multidimensional data.展开更多
The flue temperature is one of the important indicators to characterize the combustion state of an ethylene cracker furnace,the outliers of temperature data can lead to the false alarm.Conventional outlier detection a...The flue temperature is one of the important indicators to characterize the combustion state of an ethylene cracker furnace,the outliers of temperature data can lead to the false alarm.Conventional outlier detection algorithms such as the Isolation Forest algorithm and 3-sigma principle cannot detect the outliers accurately.In order to improve the detection accuracy and reduce the computational complexity,an outlier detection algorithm for flue temperature data based on the CLOF(Clipping Local Outlier Factor,CLOF)algorithm is proposed.The algorithm preprocesses the normalized data using the cluster pruning algorithm,and realizes the high accuracy and high efficiency outlier detection in the outliers candidate set.Using the flue temperature data of an ethylene cracking furnace in a petrochemical plant,the main parameters of the CLOF algorithm are selected according to the experimental results,and the outlier detection effect of the Isolation Forest algorithm,the 3-sigma principle,the conventional LOF algorithm and the CLOF algorithm are compared and analyzed.The results show that the appropriate clipping coefficient in the CLOF algorithm can significantly improve the detection efficiency and detection accuracy.Compared with the outlier detection results of the Isolation Forest algorithm and 3-sigma principle,the accuracy of the CLOF detection results is increased,and the amount of data calculation is significantly reduced.展开更多
Purpose–Among the growing number of data mining(DM)techniques,outlier detection has gained importance in many applications and also attracted much attention in recent times.In the past,outlier detection researched pa...Purpose–Among the growing number of data mining(DM)techniques,outlier detection has gained importance in many applications and also attracted much attention in recent times.In the past,outlier detection researched papers appeared in a safety care that can view as searching for the needles in the haystack.However,outliers are not always erroneous.Therefore,the purpose of this paper is to investigate the role of outliers in healthcare services in general and patient safety care,in particular.Design/methodology/approach–It is a combined DM(clustering and the nearest neighbor)technique for outliers’detection,which provides a clear understanding and meaningful insights to visualize the data behaviors for healthcare safety.The outcomes or the knowledge implicit is vitally essential to a proper clinicaldecision-making process.The method isimportant to thesemantic,andthe novel tactic of patients’events and situations prove that play a significant role in the process of patient care safety and medications.Findings–The outcomes of the paper is discussing a novel and integrated methodology,which can be inferring for different biological data analysis.It is discussed as integrated DM techniques to optimize its performancein the field of health and medicalscience.It is an integrated method of outliers detection that can be extending for searching valuable information and knowledge implicit based on selected patient factors.Based on these facts,outliers are detected as clusters and point events,and novel ideas proposed to empower clinical services in consideration of customers’satisfactions.It is also essential to be a baseline for further healthcare strategic development and research works.Research limitations/implications–This paper mainly focussed on outliers detections.Outlier isolation that are essential to investigate the reason how it happened and communications how to mitigate it did not touch.Therefore,the research can be extended more about the hierarchy of patient problems.Originality/value–DM is a dynamic and successful gateway for discovering useful knowledge for enhancing healthcare performances and patient safety.Clinical data based outlier detection is a basic task to achieve healthcare strategy.Therefore,in this paper,the authors focussed on combined DM techniques for a deep analysis of clinical data,which provide an optimal level of clinical decision-making processes.Proper clinical decisions can obtain in terms of attributes selections that important to know the influential factors or parameters of healthcare services.Therefore,using integrated clustering and nearest neighbors techniques give more acceptable searched such complex data outliers,which could be fundamental to further analysis of healthcare and patient safety situational analysis.展开更多
Outlier detection is an important task in data mining. In fact, it is difficult to find the clustering centers in some sophisticated multidimensional datasets and to measure the deviation degree of each potential outl...Outlier detection is an important task in data mining. In fact, it is difficult to find the clustering centers in some sophisticated multidimensional datasets and to measure the deviation degree of each potential outlier. In this work, an effective outlier detection method based on multi-dimensional clustering and local density(ODBMCLD) is proposed. ODBMCLD firstly identifies the center objects by the local density peak of data objects, and clusters the whole dataset based on the center objects. Then, outlier objects belonging to different clusters will be marked as candidates of abnormal data. Finally, the top N points among these abnormal candidates are chosen as final anomaly objects with high outlier factors. The feasibility and effectiveness of the method are verified by experiments.展开更多
In this paper, we present a cluster-based algorithm for time series outlier mining.We use discrete Fourier transformation (DFT) to transform time series from time domain to frequency domain. Time series thus can be ma...In this paper, we present a cluster-based algorithm for time series outlier mining.We use discrete Fourier transformation (DFT) to transform time series from time domain to frequency domain. Time series thus can be mapped as the points in k -dimensional space.For these points, a cluster-based algorithm is developed to mine the outliers from these points.The algorithm first partitions the input points into disjoint clusters and then prunes the clusters,through judgment that can not contain outliers.Our algorithm has been run in the electrical load time series of one steel enterprise and proved to be effective.展开更多
Focusing on controlling the press-assembly quality of high-precision servo mechanism,an intelligent early warning method based on outlier data detection and linear regression is proposed.Linear regression is used to d...Focusing on controlling the press-assembly quality of high-precision servo mechanism,an intelligent early warning method based on outlier data detection and linear regression is proposed.Linear regression is used to deal with the relationship between assembly quality and press-assembly process,then the mathematical model of displacement-force in press-assembly process is established and a qualified press-assembly force range is defined for assembly quality control.To preprocess the raw dataset of displacement-force in the press-assembly process,an improved local outlier factor based on area density and P weight(LAOPW)is designed to eliminate the outliers which will result in inaccuracy of the mathematical model.A weighted distance based on information entropy is used to measure distance,and the reachable distance is replaced with P weight.Experiments show that the detection efficiency of the algorithm is improved by 5.6 ms compared with the traditional local outlier factor(LOF)algorithm,and the detection accuracy is improved by about 2%compared with the local outlier factor based on area density(LAOF)algorithm.The application of LAOPW algorithm and the linear regression model shows that it can effectively carry out intelligent early warning of press-assembly quality of high precision servo mechanism.展开更多
The method of time series analysis,applied by establishing appropriate mathematical models for bridge health monitoring data and making forecasts of structural future behavior,stands out as a novel and viable research...The method of time series analysis,applied by establishing appropriate mathematical models for bridge health monitoring data and making forecasts of structural future behavior,stands out as a novel and viable research direction for bridge state assessment.However,outliers inevitably exist in the monitoring data due to various interventions,which reduce the precision of model fitting and affect the forecasting results.Therefore,the identification of outliers is crucial for the accurate interpretation of the monitoring data.In this study,a time series model combined with outlier information for bridge health monitoring is established using intervention analysis theory,and the forecasting of the structural responses is carried out.There are three techniques that we focus on:(1)the modeling of seasonal autoregressive integrated moving average(SARIMA)model;(2)the methodology for outlier identification and amendment under the circumstances that the occurrence time and type of outliers are known and unknown;(3)forecasting of the model with outlier effects.The method was tested with a case study using monitoring data on a real bridge.The establishment of the original SARIMA model without considering outliers is first discussed,including the stationarity,order determination,parameter estimation and diagnostic checking of the model.Then the time-by-time iterative procedure for outlier detection,which is implemented by appropriate test statistics of the residuals,is performed.The SARIMA-outlier model is subsequently built.Finally,a comparative analysis of the forecasting performance between the original model and SARIMA-outlier model is carried out.The results demonstrate that proper time series models are effective in mining the characteristic law of bridge monitoring data.When the influence of outliers is taken into account,the fitted precision of the model is significantly improved and the accuracy and the reliability of the forecast are strengthened.展开更多
Blast furnace data processing is prone to problems such as outliers.To overcome these problems and identify an improved method for processing blast furnace data,we conducted an in-depth study of blast furnace data.Bas...Blast furnace data processing is prone to problems such as outliers.To overcome these problems and identify an improved method for processing blast furnace data,we conducted an in-depth study of blast furnace data.Based on data samples from selected iron and steel companies,data types were classified according to different characteristics;then,appropriate methods were selected to process them in order to solve the deficiencies and outliers of the original blast furnace data.Linear interpolation was used to fill in the divided continuation data,the Knearest neighbor(KNN)algorithm was used to fill in correlation data with the internal law,and periodic statistical data were filled by the average.The error rate in the filling was low,and the fitting degree was over 85%.For the screening of outliers,corresponding indicator parameters were added according to the continuity,relevance,and periodicity of different data.Also,a variety of algorithms were used for processing.Through the analysis of screening results,a large amount of efficient information in the data was retained,and ineffective outliers were eliminated.Standardized processing of blast furnace big data as the basis of applied research on blast furnace big data can serve as an important means to improve data quality and retain data value.展开更多
In this study,we propose a low-cost system that can detect the space outlier utilization of residents in an indoor environment.We focus on the users’app usage to analyze unusual behavior,especially in indoor spaces.T...In this study,we propose a low-cost system that can detect the space outlier utilization of residents in an indoor environment.We focus on the users’app usage to analyze unusual behavior,especially in indoor spaces.This is reflected in the behavioral analysis in that the frequency of using smartphones in personal spaces has recently increased.Our system facilitates autonomous data collection from mobile app logs and Google app servers and generates a high-dimensional dataset that can detect outlier behaviors.The density-based spatial clustering of applications with noise(DBSCAN)algorithm was applied for effective singular movement analysis.To analyze high-level mobile phone usage,the t-distributed stochastic neighbor embedding(t-SNE)algorithm was employed.These two clustering algorithms can effectively detect outlier behaviors in terms of movement and app usage in indoor spaces.The experimental results showed that our system enables effective spatial behavioral analysis at a low cost when applied to logs collected in actual living spaces.Moreover,large volumes of data required for outlier detection can be easily acquired.The system can automatically detect the unusual behavior of a user in an indoor space.In particular,this study aims to reflect the recent trend of the increasing use of smartphones in indoor spaces to the behavioral analysis.展开更多
Since data services are penetrating into our daily life rapidly, the mobile network becomes more complicated, and the amount of data transmission is more and more increasing. In this case, the traditional statistical ...Since data services are penetrating into our daily life rapidly, the mobile network becomes more complicated, and the amount of data transmission is more and more increasing. In this case, the traditional statistical methods for anomalous cell detection cannot adapt to the evolution of networks, and data mining becomes the mainstream. In this paper, we propose a novel kernel density-based local outlier factor(KLOF) to assign a degree of being an outlier to each object. Firstly, the notion of KLOF is introduced, which captures exactly the relative degree of isolation. Then, by analyzing its properties, including the tightness of upper and lower bounds, sensitivity of density perturbation, we find that KLOF is much greater than 1 for outliers. Lastly, KLOFis applied on a real-world dataset to detect anomalous cells with abnormal key performance indicators(KPIs) to verify its reliability. The experiment shows that KLOF can find outliers efficiently. It can be a guideline for the operators to perform faster and more efficient trouble shooting.展开更多
Outlier detection techniques play a vital role in exploring unusual data of extreme events that have a critical effect considerably in the modeling and forecasting of functional data. The functional methods have an ef...Outlier detection techniques play a vital role in exploring unusual data of extreme events that have a critical effect considerably in the modeling and forecasting of functional data. The functional methods have an effective way of identifying outliers graphically, which might not be visible through the original data plot in classical analysis. This study’s main objective is to detect the extreme rainfall events using functional outliers detection methods depending on the depth and density functions. In order to identify the unusual events of rainfall variation over long time intervals, this work conducts based on the average monthly rainfall of the Taiz region from 1998 to 2019. Data were extracted from the Tropical Rainfall Measuring Mission and the analysis has been processed by R software. The approaches applied in this study involve rainbow plots, functional highest density region box-plot as well as functional bag-plot. According to the current results, the functional density box-plot method has proven effective in detecting outlier compared to the functional depth bag-plot method. In conclusion, the results of the current study showed that the rainfall over the Taiz region during the last two decades was influenced by the extreme events of years 1999, 2004, 2005, and 2009.展开更多
The paper puts forward a new method of density-based anomaly data mining, the method is used to design the engine of network intrusion detection system (NIDS), thus a new NIDS is constructed based on the engine. The N...The paper puts forward a new method of density-based anomaly data mining, the method is used to design the engine of network intrusion detection system (NIDS), thus a new NIDS is constructed based on the engine. The NIDS can find new unknown intrusion behaviors, which are used to updated the intrusion rule-base, based on which intrusion detections can be carried out online by the BM pattern match algorithm. Finally all modules of the NIDS are described by formalized language.展开更多
Outlier detection is a very useful technique in many applications, where data is generally uncertain and could be described using probability. While having been studied intensively in the field of deterministic data, ...Outlier detection is a very useful technique in many applications, where data is generally uncertain and could be described using probability. While having been studied intensively in the field of deterministic data, outlier detection is still novel in the emerging uncertain data field. In this paper, we study the semantic of outlier detection on probabilistic data stream and present a new definition of distance-based outlier over sliding window. We then show the problem of detecting an outlier over a set of possible world instances is equivalent to the problem of finding the k-th element in its neighborhood. Based on this observation, a dynamic programming algorithm (DPA) is proposed to reduce the detection cost from 0(2IR(~'d)l) to O(Ik.R(e, d)l), where R(e, d) is the d-neighborhood of e. Furthermore, we propose a pruning-based approach (PBA) to effectively and efficiently filter non-outliers on single window, and dynamically detect recent m elements incrementally. Finally, detailed analysis and thorough experimental results demonstrate the efficiency and scalability of our approach.展开更多
Uncertain data are common due to the increasing usage of sensors, radio frequency identification(RFID), GPS and similar devices for data collection. The causes of uncertainty include limitations of measurements, inclu...Uncertain data are common due to the increasing usage of sensors, radio frequency identification(RFID), GPS and similar devices for data collection. The causes of uncertainty include limitations of measurements, inclusion of noise, inconsistent supply voltage and delay or loss of data in transfer. In order to manage, query or mine such data, data uncertainty needs to be considered. Hence,this paper studies the problem of top-k distance-based outlier detection from uncertain data objects. In this work, an uncertain object is modelled by a probability density function of a Gaussian distribution. The naive approach of distance-based outlier detection makes use of nested loop. This approach is very costly due to the expensive distance function between two uncertain objects. Therefore,a populated-cells list(PC-list) approach of outlier detection is proposed. Using the PC-list, the proposed top-k outlier detection algorithm needs to consider only a fraction of dataset objects and hence quickly identifies candidate objects for top-k outliers. Two approximate top-k outlier detection algorithms are presented to further increase the efficiency of the top-k outlier detection algorithm.An extensive empirical study on synthetic and real datasets is also presented to prove the accuracy, efficiency and scalability of the proposed algorithms.展开更多
Air pollution is a major issue related to national economy and people's livelihood.At present,the researches on air pollution mostly focus on the pollutant emissions in a specific industry or region as a whole,and...Air pollution is a major issue related to national economy and people's livelihood.At present,the researches on air pollution mostly focus on the pollutant emissions in a specific industry or region as a whole,and is a lack of attention to enterprise pollutant emissions from the micro level.Limited by the amount and time granularity of data from enterprises,enterprise pollutant emissions are stll understudied.Driven by big data of air pollution emissions of industrial enterprises monitored in Beijing-Tianjin-Hebei,the data mining of enterprises pollution emissions is carried out in the paper,including the association analysis between different features based on grey association,the association mining between different data based on association rule and the outlier detection based on clustering.The results show that:(1)The industries affecting NOx and SO2 mainly are electric power,heat production and supply industry,metal smelting and processing industries in Beijing-Tianjin-Hebei;(2)These districts nearby Hengshui and Shijiazhuang city in Hebei province form strong association rules;(3)The industrial enterprises in Beijing-Tianjin-Hebei are divided into six clusters,of which three categories belong to outliers with excessive emissions of total vOCs,PM and NH3 respectively.展开更多
文摘Outlier detection has very important applied value in data mining literature. Different outlier detection algorithms based on distinct theories have different definitions and mining processes. The three-dimensional space graph for constructing applied algorithms and an improved GridOf algorithm were proposed in terms of analyzing the existing outlier detection algorithms from criterion and theory. Key words outlier - detection - three-dimensional space graph - data mining CLC number TP 311. 13 - TP 391 Foundation item: Supported by the National Natural Science Foundation of China (70371015)Biography: ZHANG Jing (1975-), female, Ph. D, lecturer, research direction: data mining and knowledge discovery.
基金supported by the National Key R&D Program of China(Project No.2016YFC0800200)the NRF-NSFC 3rd Joint Research Grant(Earth Science)(Project No.41861144022)+2 种基金the National Natural Science Foundation of China(Project Nos.51679174,and 51779189)the Shenzhen Key Technology R&D Program(Project No.20170324)The financial support is grateful acknowledged。
文摘Various uncertainties arising during acquisition process of geoscience data may result in anomalous data instances(i.e.,outliers)that do not conform with the expected pattern of regular data instances.With sparse multivariate data obtained from geotechnical site investigation,it is impossible to identify outliers with certainty due to the distortion of statistics of geotechnical parameters caused by outliers and their associated statistical uncertainty resulted from data sparsity.This paper develops a probabilistic outlier detection method for sparse multivariate data obtained from geotechnical site investigation.The proposed approach quantifies the outlying probability of each data instance based on Mahalanobis distance and determines outliers as those data instances with outlying probabilities greater than 0.5.It tackles the distortion issue of statistics estimated from the dataset with outliers by a re-sampling technique and accounts,rationally,for the statistical uncertainty by Bayesian machine learning.Moreover,the proposed approach also suggests an exclusive method to determine outlying components of each outlier.The proposed approach is illustrated and verified using simulated and real-life dataset.It showed that the proposed approach properly identifies outliers among sparse multivariate data and their corresponding outlying components in a probabilistic manner.It can significantly reduce the masking effect(i.e.,missing some actual outliers due to the distortion of statistics by the outliers and statistical uncertainty).It also found that outliers among sparse multivariate data instances affect significantly the construction of multivariate distribution of geotechnical parameters for uncertainty quantification.This emphasizes the necessity of data cleaning process(e.g.,outlier detection)for uncertainty quantification based on geoscience data.
基金supported by Fundamental Research Funds for the Central Universities (No. 2018XD004)
文摘The distance-based outlier detection method detects the implied outliers by calculating the distance of the points in the dataset, but the computational complexity is particularly high when processing multidimensional datasets. In addition, the traditional outlier detection method does not consider the frequency of subsets occurrence, thus, the detected outliers do not fit the definition of outliers (i.e., rarely appearing). The pattern mining-based outlier detection approaches have solved this problem, but the importance of each pattern is not taken into account in outlier detection process, so the detected outliers cannot truly reflect some actual situation. Aimed at these problems, a two-phase minimal weighted rare pattern mining-based outlier detection approach, called MWRPM-Outlier, is proposed to effectively detect outliers on the weight data stream. In particular, a method called MWRPM is proposed in the pattern mining phase to fast mine the minimal weighted rare patterns, and then two deviation factors are defined in outlier detection phase to measure the abnormal degree of each transaction on the weight data stream. Experimental results show that the proposed MWRPM-Outlier approach has excellent performance in outlier detection and MWRPM approach outperforms in weighted rare pattern mining.
基金The work described in this paper was supported by the National Natural Science Foundation of China(NSFC)under Grant No.U1501253 and Grant No.U1713217.
文摘With the development of science and technology,the status of the water environment has received more and more attention.In this paper,we propose a deep learning model,named a Joint Auto-Encoder network,to solve the problem of outlier detection in water supply data.The Joint Auto-Encoder network first expands the size of training data and extracts the useful features from the input data,and then reconstructs the input data effectively into an output.The outliers are detected based on the network’s reconstruction errors,with a larger reconstruction error indicating a higher rate to be an outlier.For water supply data,there are mainly two types of outliers:outliers with large values and those with values closed to zero.We set two separate thresholds,and,for the reconstruction errors to detect the two types of outliers respectively.The data samples with reconstruction errors exceeding the thresholds are voted to be outliers.The two thresholds can be calculated by the classification confusion matrix and the receiver operating characteristic(ROC)curve.We have also performed comparisons between the Joint Auto-Encoder and the vanilla Auto-Encoder in this paper on both the synthesis data set and the MNIST data set.As a result,our model has proved to outperform the vanilla Auto-Encoder and some other outlier detection approaches with the recall rate of 98.94 percent in water supply data.
基金Supported by the National Outstanding Youth Science Foundation of China (No. 60025308) and Key Technologies R&DProgram in the 10th Five-year Plan (No. 2001BA204B07)
文摘Data reconciliation technology can decrease the level of corruption of process data due to measurement noise, but the presence of outliers caused by process peaks or unmeasured disturbances will smear the reconciled results. Based on the analysis of limitation of conventional outlier detection algorithms, a modified outlier detection method in dynamic data reconciliation (DDR) is proposed in this paper. In the modified method, the outliers of each variable are distinguished individually and the weight is modified accordingly. Therefore, the modified method can use more information of normal data, and can efficiently decrease the effect of outliers. Simulation of a continuous stirred tank reactor (CSTR) process verifies the effectiveness of the proposed algorithm.
基金The authors would like to acknowledge the support of Southern Marine Science and Engineering Guangdong Laboratory(Zhuhai)(SML2020SP007)The paper is supported under the National Natural Science Foundation of China(Nos.61772280 and 62072249).
文摘Outlier detection is a key research area in data mining technologies,as outlier detection can identify data inconsistent within a data set.Outlier detection aims to find an abnormal data size from a large data size and has been applied in many fields including fraud detection,network intrusion detection,disaster prediction,medical diagnosis,public security,and image processing.While outlier detection has been widely applied in real systems,its effectiveness is challenged by higher dimensions and redundant data attributes,leading to detection errors and complicated calculations.The prevalence of mixed data is a current issue for outlier detection algorithms.An outlier detection method of mixed data based on neighborhood combinatorial entropy is studied to improve outlier detection performance by reducing data dimension using an attribute reduction algorithm.The significance of attributes is determined,and fewer influencing attributes are removed based on neighborhood combinatorial entropy.Outlier detection is conducted using the algorithm of local outlier factor.The proposed outlier detection method can be applied effectively in numerical and mixed multidimensional data using neighborhood combinatorial entropy.In the experimental part of this paper,we give a comparison on outlier detection before and after attribute reduction.In a comparative analysis,we give results of the enhanced outlier detection accuracy by removing the fewer influencing attributes in numerical and mixed multidimensional data.
基金Sponsored by the National Natural Science Foundation of China(Grant No.61973094)the Maoming Natural Science Foundation(Grant No.2020S004)the Guangdong Basic and Applied Basic Research Fund Project(Grant No.2023A1515012341).
文摘The flue temperature is one of the important indicators to characterize the combustion state of an ethylene cracker furnace,the outliers of temperature data can lead to the false alarm.Conventional outlier detection algorithms such as the Isolation Forest algorithm and 3-sigma principle cannot detect the outliers accurately.In order to improve the detection accuracy and reduce the computational complexity,an outlier detection algorithm for flue temperature data based on the CLOF(Clipping Local Outlier Factor,CLOF)algorithm is proposed.The algorithm preprocesses the normalized data using the cluster pruning algorithm,and realizes the high accuracy and high efficiency outlier detection in the outliers candidate set.Using the flue temperature data of an ethylene cracking furnace in a petrochemical plant,the main parameters of the CLOF algorithm are selected according to the experimental results,and the outlier detection effect of the Isolation Forest algorithm,the 3-sigma principle,the conventional LOF algorithm and the CLOF algorithm are compared and analyzed.The results show that the appropriate clipping coefficient in the CLOF algorithm can significantly improve the detection efficiency and detection accuracy.Compared with the outlier detection results of the Isolation Forest algorithm and 3-sigma principle,the accuracy of the CLOF detection results is increased,and the amount of data calculation is significantly reduced.
基金The work supported by the National Natural Science Foundation of China under Grant No.61374135.
文摘Purpose–Among the growing number of data mining(DM)techniques,outlier detection has gained importance in many applications and also attracted much attention in recent times.In the past,outlier detection researched papers appeared in a safety care that can view as searching for the needles in the haystack.However,outliers are not always erroneous.Therefore,the purpose of this paper is to investigate the role of outliers in healthcare services in general and patient safety care,in particular.Design/methodology/approach–It is a combined DM(clustering and the nearest neighbor)technique for outliers’detection,which provides a clear understanding and meaningful insights to visualize the data behaviors for healthcare safety.The outcomes or the knowledge implicit is vitally essential to a proper clinicaldecision-making process.The method isimportant to thesemantic,andthe novel tactic of patients’events and situations prove that play a significant role in the process of patient care safety and medications.Findings–The outcomes of the paper is discussing a novel and integrated methodology,which can be inferring for different biological data analysis.It is discussed as integrated DM techniques to optimize its performancein the field of health and medicalscience.It is an integrated method of outliers detection that can be extending for searching valuable information and knowledge implicit based on selected patient factors.Based on these facts,outliers are detected as clusters and point events,and novel ideas proposed to empower clinical services in consideration of customers’satisfactions.It is also essential to be a baseline for further healthcare strategic development and research works.Research limitations/implications–This paper mainly focussed on outliers detections.Outlier isolation that are essential to investigate the reason how it happened and communications how to mitigate it did not touch.Therefore,the research can be extended more about the hierarchy of patient problems.Originality/value–DM is a dynamic and successful gateway for discovering useful knowledge for enhancing healthcare performances and patient safety.Clinical data based outlier detection is a basic task to achieve healthcare strategy.Therefore,in this paper,the authors focussed on combined DM techniques for a deep analysis of clinical data,which provide an optimal level of clinical decision-making processes.Proper clinical decisions can obtain in terms of attributes selections that important to know the influential factors or parameters of healthcare services.Therefore,using integrated clustering and nearest neighbors techniques give more acceptable searched such complex data outliers,which could be fundamental to further analysis of healthcare and patient safety situational analysis.
基金Project(61362021)supported by the National Natural Science Foundation of ChinaProject(2016GXNSFAA380149)supported by Natural Science Foundation of Guangxi Province,China+1 种基金Projects(2016YJCXB02,2017YJCX34)supported by Innovation Project of GUET Graduate Education,ChinaProject(2011KF11)supported by the Key Laboratory of Cognitive Radio and Information Processing,Ministry of Education,China
文摘Outlier detection is an important task in data mining. In fact, it is difficult to find the clustering centers in some sophisticated multidimensional datasets and to measure the deviation degree of each potential outlier. In this work, an effective outlier detection method based on multi-dimensional clustering and local density(ODBMCLD) is proposed. ODBMCLD firstly identifies the center objects by the local density peak of data objects, and clusters the whole dataset based on the center objects. Then, outlier objects belonging to different clusters will be marked as candidates of abnormal data. Finally, the top N points among these abnormal candidates are chosen as final anomaly objects with high outlier factors. The feasibility and effectiveness of the method are verified by experiments.
文摘In this paper, we present a cluster-based algorithm for time series outlier mining.We use discrete Fourier transformation (DFT) to transform time series from time domain to frequency domain. Time series thus can be mapped as the points in k -dimensional space.For these points, a cluster-based algorithm is developed to mine the outliers from these points.The algorithm first partitions the input points into disjoint clusters and then prunes the clusters,through judgment that can not contain outliers.Our algorithm has been run in the electrical load time series of one steel enterprise and proved to be effective.
文摘Focusing on controlling the press-assembly quality of high-precision servo mechanism,an intelligent early warning method based on outlier data detection and linear regression is proposed.Linear regression is used to deal with the relationship between assembly quality and press-assembly process,then the mathematical model of displacement-force in press-assembly process is established and a qualified press-assembly force range is defined for assembly quality control.To preprocess the raw dataset of displacement-force in the press-assembly process,an improved local outlier factor based on area density and P weight(LAOPW)is designed to eliminate the outliers which will result in inaccuracy of the mathematical model.A weighted distance based on information entropy is used to measure distance,and the reachable distance is replaced with P weight.Experiments show that the detection efficiency of the algorithm is improved by 5.6 ms compared with the traditional local outlier factor(LOF)algorithm,and the detection accuracy is improved by about 2%compared with the local outlier factor based on area density(LAOF)algorithm.The application of LAOPW algorithm and the linear regression model shows that it can effectively carry out intelligent early warning of press-assembly quality of high precision servo mechanism.
基金funded by the Natural Science Foundation of Fujian Province(Grant No.2020J05207)Fujian University Engineering Research Center for Disaster Prevention and Mitigation of Engineering Structures along the Southeast Coast(Grant No.JDGC03)+1 种基金Major Scientific Research Platform Project of Putian City(Grant No.2021ZP03)Talent Introduction Project of Putian University(Grant No.2018074).
文摘The method of time series analysis,applied by establishing appropriate mathematical models for bridge health monitoring data and making forecasts of structural future behavior,stands out as a novel and viable research direction for bridge state assessment.However,outliers inevitably exist in the monitoring data due to various interventions,which reduce the precision of model fitting and affect the forecasting results.Therefore,the identification of outliers is crucial for the accurate interpretation of the monitoring data.In this study,a time series model combined with outlier information for bridge health monitoring is established using intervention analysis theory,and the forecasting of the structural responses is carried out.There are three techniques that we focus on:(1)the modeling of seasonal autoregressive integrated moving average(SARIMA)model;(2)the methodology for outlier identification and amendment under the circumstances that the occurrence time and type of outliers are known and unknown;(3)forecasting of the model with outlier effects.The method was tested with a case study using monitoring data on a real bridge.The establishment of the original SARIMA model without considering outliers is first discussed,including the stationarity,order determination,parameter estimation and diagnostic checking of the model.Then the time-by-time iterative procedure for outlier detection,which is implemented by appropriate test statistics of the residuals,is performed.The SARIMA-outlier model is subsequently built.Finally,a comparative analysis of the forecasting performance between the original model and SARIMA-outlier model is carried out.The results demonstrate that proper time series models are effective in mining the characteristic law of bridge monitoring data.When the influence of outliers is taken into account,the fitted precision of the model is significantly improved and the accuracy and the reliability of the forecast are strengthened.
基金This work is financially supported by the National Nature Science Foundation of China(No.52004096)the Hebei Province High-End Iron and Steel Metallurgical Joint Research Fund Project,China(No.E2019209314)+1 种基金the Scientific Research Program Project of Hebei Education Department,China(No.QN2019200)the Tangshan Science and Technology Planning Project,China(No.19150241E).
文摘Blast furnace data processing is prone to problems such as outliers.To overcome these problems and identify an improved method for processing blast furnace data,we conducted an in-depth study of blast furnace data.Based on data samples from selected iron and steel companies,data types were classified according to different characteristics;then,appropriate methods were selected to process them in order to solve the deficiencies and outliers of the original blast furnace data.Linear interpolation was used to fill in the divided continuation data,the Knearest neighbor(KNN)algorithm was used to fill in correlation data with the internal law,and periodic statistical data were filled by the average.The error rate in the filling was low,and the fitting degree was over 85%.For the screening of outliers,corresponding indicator parameters were added according to the continuity,relevance,and periodicity of different data.Also,a variety of algorithms were used for processing.Through the analysis of screening results,a large amount of efficient information in the data was retained,and ineffective outliers were eliminated.Standardized processing of blast furnace big data as the basis of applied research on blast furnace big data can serve as an important means to improve data quality and retain data value.
文摘In this study,we propose a low-cost system that can detect the space outlier utilization of residents in an indoor environment.We focus on the users’app usage to analyze unusual behavior,especially in indoor spaces.This is reflected in the behavioral analysis in that the frequency of using smartphones in personal spaces has recently increased.Our system facilitates autonomous data collection from mobile app logs and Google app servers and generates a high-dimensional dataset that can detect outlier behaviors.The density-based spatial clustering of applications with noise(DBSCAN)algorithm was applied for effective singular movement analysis.To analyze high-level mobile phone usage,the t-distributed stochastic neighbor embedding(t-SNE)algorithm was employed.These two clustering algorithms can effectively detect outlier behaviors in terms of movement and app usage in indoor spaces.The experimental results showed that our system enables effective spatial behavioral analysis at a low cost when applied to logs collected in actual living spaces.Moreover,large volumes of data required for outlier detection can be easily acquired.The system can automatically detect the unusual behavior of a user in an indoor space.In particular,this study aims to reflect the recent trend of the increasing use of smartphones in indoor spaces to the behavioral analysis.
基金supported by the National Basic Research Program of China (973 Program: 2013CB329004)
文摘Since data services are penetrating into our daily life rapidly, the mobile network becomes more complicated, and the amount of data transmission is more and more increasing. In this case, the traditional statistical methods for anomalous cell detection cannot adapt to the evolution of networks, and data mining becomes the mainstream. In this paper, we propose a novel kernel density-based local outlier factor(KLOF) to assign a degree of being an outlier to each object. Firstly, the notion of KLOF is introduced, which captures exactly the relative degree of isolation. Then, by analyzing its properties, including the tightness of upper and lower bounds, sensitivity of density perturbation, we find that KLOF is much greater than 1 for outliers. Lastly, KLOFis applied on a real-world dataset to detect anomalous cells with abnormal key performance indicators(KPIs) to verify its reliability. The experiment shows that KLOF can find outliers efficiently. It can be a guideline for the operators to perform faster and more efficient trouble shooting.
文摘Outlier detection techniques play a vital role in exploring unusual data of extreme events that have a critical effect considerably in the modeling and forecasting of functional data. The functional methods have an effective way of identifying outliers graphically, which might not be visible through the original data plot in classical analysis. This study’s main objective is to detect the extreme rainfall events using functional outliers detection methods depending on the depth and density functions. In order to identify the unusual events of rainfall variation over long time intervals, this work conducts based on the average monthly rainfall of the Taiz region from 1998 to 2019. Data were extracted from the Tropical Rainfall Measuring Mission and the analysis has been processed by R software. The approaches applied in this study involve rainbow plots, functional highest density region box-plot as well as functional bag-plot. According to the current results, the functional density box-plot method has proven effective in detecting outlier compared to the functional depth bag-plot method. In conclusion, the results of the current study showed that the rainfall over the Taiz region during the last two decades was influenced by the extreme events of years 1999, 2004, 2005, and 2009.
基金Funded by Shaanxi Natural Science Foundation(2002G07)
文摘The paper puts forward a new method of density-based anomaly data mining, the method is used to design the engine of network intrusion detection system (NIDS), thus a new NIDS is constructed based on the engine. The NIDS can find new unknown intrusion behaviors, which are used to updated the intrusion rule-base, based on which intrusion detections can be carried out online by the BM pattern match algorithm. Finally all modules of the NIDS are described by formalized language.
基金supported by the National Natural Science Foundation of China under Grant Nos. 60973020, 60828004,and 60933001the Program for New Century Excellent Talents in University of China under Grant No. NCET-06-0290the Fundamental Research Funds for the Central Universities under Grant No. N090504004
文摘Outlier detection is a very useful technique in many applications, where data is generally uncertain and could be described using probability. While having been studied intensively in the field of deterministic data, outlier detection is still novel in the emerging uncertain data field. In this paper, we study the semantic of outlier detection on probabilistic data stream and present a new definition of distance-based outlier over sliding window. We then show the problem of detecting an outlier over a set of possible world instances is equivalent to the problem of finding the k-th element in its neighborhood. Based on this observation, a dynamic programming algorithm (DPA) is proposed to reduce the detection cost from 0(2IR(~'d)l) to O(Ik.R(e, d)l), where R(e, d) is the d-neighborhood of e. Furthermore, we propose a pruning-based approach (PBA) to effectively and efficiently filter non-outliers on single window, and dynamically detect recent m elements incrementally. Finally, detailed analysis and thorough experimental results demonstrate the efficiency and scalability of our approach.
基金supported by Grant-in-Aid for Scientific Research(A)(#24240015A)
文摘Uncertain data are common due to the increasing usage of sensors, radio frequency identification(RFID), GPS and similar devices for data collection. The causes of uncertainty include limitations of measurements, inclusion of noise, inconsistent supply voltage and delay or loss of data in transfer. In order to manage, query or mine such data, data uncertainty needs to be considered. Hence,this paper studies the problem of top-k distance-based outlier detection from uncertain data objects. In this work, an uncertain object is modelled by a probability density function of a Gaussian distribution. The naive approach of distance-based outlier detection makes use of nested loop. This approach is very costly due to the expensive distance function between two uncertain objects. Therefore,a populated-cells list(PC-list) approach of outlier detection is proposed. Using the PC-list, the proposed top-k outlier detection algorithm needs to consider only a fraction of dataset objects and hence quickly identifies candidate objects for top-k outliers. Two approximate top-k outlier detection algorithms are presented to further increase the efficiency of the top-k outlier detection algorithm.An extensive empirical study on synthetic and real datasets is also presented to prove the accuracy, efficiency and scalability of the proposed algorithms.
基金supported by the National Natural Science Foundation of China[grant number 72271033]the Beijing Municipal Education Commission and Beijing Natural Science Foundation[grant number KZ202110017025]the National Undergraduate Innovation and Entrepreneurship Plan Project(2022J00244).
文摘Air pollution is a major issue related to national economy and people's livelihood.At present,the researches on air pollution mostly focus on the pollutant emissions in a specific industry or region as a whole,and is a lack of attention to enterprise pollutant emissions from the micro level.Limited by the amount and time granularity of data from enterprises,enterprise pollutant emissions are stll understudied.Driven by big data of air pollution emissions of industrial enterprises monitored in Beijing-Tianjin-Hebei,the data mining of enterprises pollution emissions is carried out in the paper,including the association analysis between different features based on grey association,the association mining between different data based on association rule and the outlier detection based on clustering.The results show that:(1)The industries affecting NOx and SO2 mainly are electric power,heat production and supply industry,metal smelting and processing industries in Beijing-Tianjin-Hebei;(2)These districts nearby Hengshui and Shijiazhuang city in Hebei province form strong association rules;(3)The industrial enterprises in Beijing-Tianjin-Hebei are divided into six clusters,of which three categories belong to outliers with excessive emissions of total vOCs,PM and NH3 respectively.