Changepoint detection faces challenges when outlier data are present. This paper proposes a multivariate changepoint detection method which is based on the robust WPCA projection direction and the robust RFPOP method,...Changepoint detection faces challenges when outlier data are present. This paper proposes a multivariate changepoint detection method which is based on the robust WPCA projection direction and the robust RFPOP method, RWPCA-RFPOP method. Our method is double robust which is suitable for detecting mean changepoints in multivariate normal data with high correlations between variables that include outliers. Simulation results demonstrate that our method provides strong guarantees on both the number and location of changepoints in the presence of outliers. Finally, our method is well applied in an ACGH dataset.展开更多
To study the difference of industrial location among different industries, this article is to test the spatial agglomeration across industries and firm sizes at the city level. Our research bases on a unique plant-lev...To study the difference of industrial location among different industries, this article is to test the spatial agglomeration across industries and firm sizes at the city level. Our research bases on a unique plant-level data set of Beijing and employs a distance-based approach, which considers space as continuous. Unlike previous studies, we set two sets of references for service and manufacturing industries respectively to adapt to the investigation in the intra-urban area. Comparing among eight types of industries and different firm sizes, we find that: 1) producer service, high-tech industries and labor-intensive manufacturing industries are more likely to cluster, whereas personal service and capital-intensive industries tend to be randomly dispersed in Beijing; 2) the spillover of the co-location of finns is more important to knowledge-intensive industries and has more significant impact on their allocation than business-oriented services in the intra-urban area; 3) the spatial agglomeration of service industries are driven by larger establishments, whereas manufac- turing industries are mixed.展开更多
The Nei's improved genetic distance(DA)and gene flow(Nm)were measured using sixteen microsatellite markers.Dendograms based on DA genetic distance using the neighbor-joining(NJ)method and STRUCTURE program were co...The Nei's improved genetic distance(DA)and gene flow(Nm)were measured using sixteen microsatellite markers.Dendograms based on DA genetic distance using the neighbor-joining(NJ)method and STRUCTURE program were constructed to analyze the genetic structure and relationship among 10 Chinese indigenous chicken breeds.The results showed that dendograms of DA genetic distance using the NJ method divided the 10 chicken breeds into two main clusters;one consisted of breeds of low weight body(CHA,TTB,XIA,GUS and BAI),the other contained heavier breeds(LAN,DAG,YOU,XIS and LUY).In the lighter breeds,TIB and CHA clustered together,as did XIA and GUS.In the heavier breeds,XIS and LUY was clustered together in one branch,but LAN,DAG and YOU clustered in independent branches.The results were consistent with Nm estimates among the 10 indigenous chicken breeds.The STRUCTURE program properly inferred the presence of genetic structure despite not pre-defining the origin of individuals.The genetic cluster inferred by STRUCTURE was basically the same as that from the DA distance clustering method.An advantage of the STRUCTURE program was its ability to identify the migrants and admixed individuals in the 10 chicken populations;this could not be achieved by use of the DA distance clustering method.展开更多
Detecting the boundaries of protein domains is an important and challenging task in both experimental and computational structural biology. In this paper, a promising method for detecting the domain structure of a pro...Detecting the boundaries of protein domains is an important and challenging task in both experimental and computational structural biology. In this paper, a promising method for detecting the domain structure of a protein from sequence information alone is presented. The method is based on analyzing multiple sequence alignments derived from a database search. Multiple measures are defined to quantify the domain information content of each position along the sequence. Then they are combined into a single predictor using support vector machine. What is more important, the domain detection is first taken as an imbal- anced data learning problem. A novel undersampling method is proposed on distance-based maximal entropy in the feature space of Support Vector Machine (SVM). The overall precision is about 80%. Simulation results demonstrate that the method can help not only in predicting the complete 3D structure of a protein but also in the machine learning system on general im- balanced datasets.展开更多
The urban transit fare structure and level can largely affect passengers’travel behavior and route choices.The commonly used transit fare policies in the present transit network would lead to the unbalanced transit a...The urban transit fare structure and level can largely affect passengers’travel behavior and route choices.The commonly used transit fare policies in the present transit network would lead to the unbalanced transit assignment and improper transit resources distribution.In order to distribute transit passenger flow evenly and efficiently,this paper introduces a new distance-based fare pattern with Euclidean distance.A bi-level programming model is developed for determining the optimal distance-based fare pattern,with the path-based stochastic transit assignment(STA)problem with elastic demand being proposed at the lower level.The upper-level intends to address a principal-agent game between transport authorities and transit enterprises pursing maximization of social welfare and financial interest,respectively.A genetic algorithm(GA)is implemented to solve the bi-level model,which is verified by a numerical example to illustrate that the proposed nonlinear distance-based fare pattern presents a better financial performance and distribution effect than other fare structures.展开更多
A new update strategy, distance-based update strategy, is presented in Location Dependent Continuous Query (LDCQ) under error limitation. There are different possibilities to intersect when the distances between movin...A new update strategy, distance-based update strategy, is presented in Location Dependent Continuous Query (LDCQ) under error limitation. There are different possibilities to intersect when the distances between moving objects and the querying boundary are different.Therefore, moving objects have different influences to the query result. We set different deviation limits for different moving objects according to distances. A great number of unnecessary updates are reduced and the payload of the system is relieved.展开更多
A variety of factors affect air quality, making it a difficult issue. The level of clean air in a certain area is referred to as air quality. It is challenging for conventional approaches to correctly discover aberran...A variety of factors affect air quality, making it a difficult issue. The level of clean air in a certain area is referred to as air quality. It is challenging for conventional approaches to correctly discover aberrant values or outliers due to the significant fluctuation of this sort of data, which is influenced by Climate change and the environment. With accelerating industrial expansion and rising population density in Kolkata City, air pollution is continuously rising. This study involves two phases, in the first phase imputation of missing values and second detection of outliers using Statistical Process Control (SPC), and Functional Data Analysis (FDA), studies to achieve the efficacy of the outlier identification methodology proposed with working days and Nonworking days of the variables NO<sub>2</sub>, SO<sub>2</sub>, and O<sub>3</sub>, which were used for a year in a row in Kolkata, India. The results show how the functional data approach outshines traditional outlier detection methods. The outcomes show that functional data analysis vibrates more than the other two approaches after imputation, and the suggested outlier detector is absolutely appropriate for the precise detection of outliers in highly variable data.展开更多
The flue temperature is one of the important indicators to characterize the combustion state of an ethylene cracker furnace,the outliers of temperature data can lead to the false alarm.Conventional outlier detection a...The flue temperature is one of the important indicators to characterize the combustion state of an ethylene cracker furnace,the outliers of temperature data can lead to the false alarm.Conventional outlier detection algorithms such as the Isolation Forest algorithm and 3-sigma principle cannot detect the outliers accurately.In order to improve the detection accuracy and reduce the computational complexity,an outlier detection algorithm for flue temperature data based on the CLOF(Clipping Local Outlier Factor,CLOF)algorithm is proposed.The algorithm preprocesses the normalized data using the cluster pruning algorithm,and realizes the high accuracy and high efficiency outlier detection in the outliers candidate set.Using the flue temperature data of an ethylene cracking furnace in a petrochemical plant,the main parameters of the CLOF algorithm are selected according to the experimental results,and the outlier detection effect of the Isolation Forest algorithm,the 3-sigma principle,the conventional LOF algorithm and the CLOF algorithm are compared and analyzed.The results show that the appropriate clipping coefficient in the CLOF algorithm can significantly improve the detection efficiency and detection accuracy.Compared with the outlier detection results of the Isolation Forest algorithm and 3-sigma principle,the accuracy of the CLOF detection results is increased,and the amount of data calculation is significantly reduced.展开更多
Due to the advancements in information technologies,massive quantity of data is being produced by social media,smartphones,and sensor devices.The investigation of data stream by the use of machine learning(ML)approach...Due to the advancements in information technologies,massive quantity of data is being produced by social media,smartphones,and sensor devices.The investigation of data stream by the use of machine learning(ML)approaches to address regression,prediction,and classification problems have received consid-erable interest.At the same time,the detection of anomalies or outliers and feature selection(FS)processes becomes important.This study develops an outlier detec-tion with feature selection technique for streaming data classification,named ODFST-SDC technique.Initially,streaming data is pre-processed in two ways namely categorical encoding and null value removal.In addition,Local Correla-tion Integral(LOCI)is used which is significant in the detection and removal of outliers.Besides,red deer algorithm(RDA)based FS approach is employed to derive an optimal subset of features.Finally,kernel extreme learning machine(KELM)classifier is used for streaming data classification.The design of LOCI based outlier detection and RDA based FS shows the novelty of the work.In order to assess the classification outcomes of the ODFST-SDC technique,a series of simulations were performed using three benchmark datasets.The experimental results reported the promising outcomes of the ODFST-SDC technique over the recent approaches.展开更多
Background Image matching is crucial in numerous computer vision tasks such as 3D reconstruction and simultaneous visual localization and mapping.The accuracy of the matching significantly impacted subsequent studies....Background Image matching is crucial in numerous computer vision tasks such as 3D reconstruction and simultaneous visual localization and mapping.The accuracy of the matching significantly impacted subsequent studies.Because of their local similarity,when image pairs contain comparable patterns but feature pairs are positioned differently,incorrect recognition can occur as global motion consistency is disregarded.Methods This study proposes an image-matching filtering algorithm based on global motion consistency.It can be used as a subsequent matching filter for the initial matching results generated by other matching algorithms based on the principle of motion smoothness.A particular matching algorithm can first be used to perform the initial matching;then,the rotation and movement information of the global feature vectors are combined to effectively identify outlier matches.The principle is that if the matching result is accurate,the feature vectors formed by any matched point should have similar rotation angles and moving distances.Thus,global motion direction and global motion distance consistencies were used to reject outliers caused by similar patterns in different locations.Results Four datasets were used to test the effectiveness of the proposed method.Three datasets with similar patterns in different locations were used to test the results for similar images that could easily be incorrectly matched by other algorithms,and one commonly used dataset was used to test the results for the general image-matching problem.The experimental results suggest that the proposed method is more accurate than other state-of-the-art algorithms in identifying mismatches in the initial matching set.Conclusions The proposed outlier rejection matching method can significantly improve the matching accuracy for similar images with locally similar feature pairs in different locations and can provide more accurate matching results for subsequent computer vision tasks.展开更多
Human living would be impossible without air quality. Consistent advancements in practically every aspect of contemporary human life have harmed air quality. Everyday industrial, transportation, and home activities tu...Human living would be impossible without air quality. Consistent advancements in practically every aspect of contemporary human life have harmed air quality. Everyday industrial, transportation, and home activities turn up dangerous contaminants in our surroundings. This study investigated two years’ worth of air quality and outlier detection data from two Indian cities. Studies on air pollution have used numerous types of methodologies, with various gases being seen as a vector whose components include gas concentration values for each observation per-formed. We use curves to represent the monthly average of daily gas emissions in our technique. The approach, which is based on functional depth, was used to find outliers in the city of Delhi and Kolkata’s gas emissions, and the outcomes were compared to those from the traditional method. In the evaluation and comparison of these models’ performances, the functional approach model studied well.展开更多
文摘Changepoint detection faces challenges when outlier data are present. This paper proposes a multivariate changepoint detection method which is based on the robust WPCA projection direction and the robust RFPOP method, RWPCA-RFPOP method. Our method is double robust which is suitable for detecting mean changepoints in multivariate normal data with high correlations between variables that include outliers. Simulation results demonstrate that our method provides strong guarantees on both the number and location of changepoints in the presence of outliers. Finally, our method is well applied in an ACGH dataset.
基金State Key Program of National Natural Science of China(No.41230632)National Natural Science Foundation of China(No.41301123,41201169)
文摘To study the difference of industrial location among different industries, this article is to test the spatial agglomeration across industries and firm sizes at the city level. Our research bases on a unique plant-level data set of Beijing and employs a distance-based approach, which considers space as continuous. Unlike previous studies, we set two sets of references for service and manufacturing industries respectively to adapt to the investigation in the intra-urban area. Comparing among eight types of industries and different firm sizes, we find that: 1) producer service, high-tech industries and labor-intensive manufacturing industries are more likely to cluster, whereas personal service and capital-intensive industries tend to be randomly dispersed in Beijing; 2) the spillover of the co-location of finns is more important to knowledge-intensive industries and has more significant impact on their allocation than business-oriented services in the intra-urban area; 3) the spatial agglomeration of service industries are driven by larger establishments, whereas manufac- turing industries are mixed.
基金supported by the Program of National Technological Basis from Ministry of Science and Technology of China(No.2005DKA21101)the National Natural Science Foundation of China(No.30700572)
文摘The Nei's improved genetic distance(DA)and gene flow(Nm)were measured using sixteen microsatellite markers.Dendograms based on DA genetic distance using the neighbor-joining(NJ)method and STRUCTURE program were constructed to analyze the genetic structure and relationship among 10 Chinese indigenous chicken breeds.The results showed that dendograms of DA genetic distance using the NJ method divided the 10 chicken breeds into two main clusters;one consisted of breeds of low weight body(CHA,TTB,XIA,GUS and BAI),the other contained heavier breeds(LAN,DAG,YOU,XIS and LUY).In the lighter breeds,TIB and CHA clustered together,as did XIA and GUS.In the heavier breeds,XIS and LUY was clustered together in one branch,but LAN,DAG and YOU clustered in independent branches.The results were consistent with Nm estimates among the 10 indigenous chicken breeds.The STRUCTURE program properly inferred the presence of genetic structure despite not pre-defining the origin of individuals.The genetic cluster inferred by STRUCTURE was basically the same as that from the DA distance clustering method.An advantage of the STRUCTURE program was its ability to identify the migrants and admixed individuals in the 10 chicken populations;this could not be achieved by use of the DA distance clustering method.
基金National Natural Science Foundation of China (Grant No. 60433020, 60673099, 60673023)"985" project of Jilin University
文摘Detecting the boundaries of protein domains is an important and challenging task in both experimental and computational structural biology. In this paper, a promising method for detecting the domain structure of a protein from sequence information alone is presented. The method is based on analyzing multiple sequence alignments derived from a database search. Multiple measures are defined to quantify the domain information content of each position along the sequence. Then they are combined into a single predictor using support vector machine. What is more important, the domain detection is first taken as an imbal- anced data learning problem. A novel undersampling method is proposed on distance-based maximal entropy in the feature space of Support Vector Machine (SVM). The overall precision is about 80%. Simulation results demonstrate that the method can help not only in predicting the complete 3D structure of a protein but also in the machine learning system on general im- balanced datasets.
基金the Humanities and Social Science Foundation of the Ministry of Education of China(Grant No.20YJCZH121).
文摘The urban transit fare structure and level can largely affect passengers’travel behavior and route choices.The commonly used transit fare policies in the present transit network would lead to the unbalanced transit assignment and improper transit resources distribution.In order to distribute transit passenger flow evenly and efficiently,this paper introduces a new distance-based fare pattern with Euclidean distance.A bi-level programming model is developed for determining the optimal distance-based fare pattern,with the path-based stochastic transit assignment(STA)problem with elastic demand being proposed at the lower level.The upper-level intends to address a principal-agent game between transport authorities and transit enterprises pursing maximization of social welfare and financial interest,respectively.A genetic algorithm(GA)is implemented to solve the bi-level model,which is verified by a numerical example to illustrate that the proposed nonlinear distance-based fare pattern presents a better financial performance and distribution effect than other fare structures.
文摘A new update strategy, distance-based update strategy, is presented in Location Dependent Continuous Query (LDCQ) under error limitation. There are different possibilities to intersect when the distances between moving objects and the querying boundary are different.Therefore, moving objects have different influences to the query result. We set different deviation limits for different moving objects according to distances. A great number of unnecessary updates are reduced and the payload of the system is relieved.
文摘A variety of factors affect air quality, making it a difficult issue. The level of clean air in a certain area is referred to as air quality. It is challenging for conventional approaches to correctly discover aberrant values or outliers due to the significant fluctuation of this sort of data, which is influenced by Climate change and the environment. With accelerating industrial expansion and rising population density in Kolkata City, air pollution is continuously rising. This study involves two phases, in the first phase imputation of missing values and second detection of outliers using Statistical Process Control (SPC), and Functional Data Analysis (FDA), studies to achieve the efficacy of the outlier identification methodology proposed with working days and Nonworking days of the variables NO<sub>2</sub>, SO<sub>2</sub>, and O<sub>3</sub>, which were used for a year in a row in Kolkata, India. The results show how the functional data approach outshines traditional outlier detection methods. The outcomes show that functional data analysis vibrates more than the other two approaches after imputation, and the suggested outlier detector is absolutely appropriate for the precise detection of outliers in highly variable data.
基金Sponsored by the National Natural Science Foundation of China(Grant No.61973094)the Maoming Natural Science Foundation(Grant No.2020S004)the Guangdong Basic and Applied Basic Research Fund Project(Grant No.2023A1515012341).
文摘The flue temperature is one of the important indicators to characterize the combustion state of an ethylene cracker furnace,the outliers of temperature data can lead to the false alarm.Conventional outlier detection algorithms such as the Isolation Forest algorithm and 3-sigma principle cannot detect the outliers accurately.In order to improve the detection accuracy and reduce the computational complexity,an outlier detection algorithm for flue temperature data based on the CLOF(Clipping Local Outlier Factor,CLOF)algorithm is proposed.The algorithm preprocesses the normalized data using the cluster pruning algorithm,and realizes the high accuracy and high efficiency outlier detection in the outliers candidate set.Using the flue temperature data of an ethylene cracking furnace in a petrochemical plant,the main parameters of the CLOF algorithm are selected according to the experimental results,and the outlier detection effect of the Isolation Forest algorithm,the 3-sigma principle,the conventional LOF algorithm and the CLOF algorithm are compared and analyzed.The results show that the appropriate clipping coefficient in the CLOF algorithm can significantly improve the detection efficiency and detection accuracy.Compared with the outlier detection results of the Isolation Forest algorithm and 3-sigma principle,the accuracy of the CLOF detection results is increased,and the amount of data calculation is significantly reduced.
文摘Due to the advancements in information technologies,massive quantity of data is being produced by social media,smartphones,and sensor devices.The investigation of data stream by the use of machine learning(ML)approaches to address regression,prediction,and classification problems have received consid-erable interest.At the same time,the detection of anomalies or outliers and feature selection(FS)processes becomes important.This study develops an outlier detec-tion with feature selection technique for streaming data classification,named ODFST-SDC technique.Initially,streaming data is pre-processed in two ways namely categorical encoding and null value removal.In addition,Local Correla-tion Integral(LOCI)is used which is significant in the detection and removal of outliers.Besides,red deer algorithm(RDA)based FS approach is employed to derive an optimal subset of features.Finally,kernel extreme learning machine(KELM)classifier is used for streaming data classification.The design of LOCI based outlier detection and RDA based FS shows the novelty of the work.In order to assess the classification outcomes of the ODFST-SDC technique,a series of simulations were performed using three benchmark datasets.The experimental results reported the promising outcomes of the ODFST-SDC technique over the recent approaches.
基金Supported by the Natural Science Foundation of China(62072388,62276146)the Industry Guidance Project Foundation of Science technology Bureau of Fujian province(2020H0047)+2 种基金the Natural Science Foundation of Science Technology Bureau of Fujian province(2019J01601)the Creation Fund project of Science Technology Bureau of Fujian province(JAT190596)Putian University Research Project(2022034)。
文摘Background Image matching is crucial in numerous computer vision tasks such as 3D reconstruction and simultaneous visual localization and mapping.The accuracy of the matching significantly impacted subsequent studies.Because of their local similarity,when image pairs contain comparable patterns but feature pairs are positioned differently,incorrect recognition can occur as global motion consistency is disregarded.Methods This study proposes an image-matching filtering algorithm based on global motion consistency.It can be used as a subsequent matching filter for the initial matching results generated by other matching algorithms based on the principle of motion smoothness.A particular matching algorithm can first be used to perform the initial matching;then,the rotation and movement information of the global feature vectors are combined to effectively identify outlier matches.The principle is that if the matching result is accurate,the feature vectors formed by any matched point should have similar rotation angles and moving distances.Thus,global motion direction and global motion distance consistencies were used to reject outliers caused by similar patterns in different locations.Results Four datasets were used to test the effectiveness of the proposed method.Three datasets with similar patterns in different locations were used to test the results for similar images that could easily be incorrectly matched by other algorithms,and one commonly used dataset was used to test the results for the general image-matching problem.The experimental results suggest that the proposed method is more accurate than other state-of-the-art algorithms in identifying mismatches in the initial matching set.Conclusions The proposed outlier rejection matching method can significantly improve the matching accuracy for similar images with locally similar feature pairs in different locations and can provide more accurate matching results for subsequent computer vision tasks.
文摘Human living would be impossible without air quality. Consistent advancements in practically every aspect of contemporary human life have harmed air quality. Everyday industrial, transportation, and home activities turn up dangerous contaminants in our surroundings. This study investigated two years’ worth of air quality and outlier detection data from two Indian cities. Studies on air pollution have used numerous types of methodologies, with various gases being seen as a vector whose components include gas concentration values for each observation per-formed. We use curves to represent the monthly average of daily gas emissions in our technique. The approach, which is based on functional depth, was used to find outliers in the city of Delhi and Kolkata’s gas emissions, and the outcomes were compared to those from the traditional method. In the evaluation and comparison of these models’ performances, the functional approach model studied well.