Outlier detection is a key research area in data mining technologies,as outlier detection can identify data inconsistent within a data set.Outlier detection aims to find an abnormal data size from a large data size an...Outlier detection is a key research area in data mining technologies,as outlier detection can identify data inconsistent within a data set.Outlier detection aims to find an abnormal data size from a large data size and has been applied in many fields including fraud detection,network intrusion detection,disaster prediction,medical diagnosis,public security,and image processing.While outlier detection has been widely applied in real systems,its effectiveness is challenged by higher dimensions and redundant data attributes,leading to detection errors and complicated calculations.The prevalence of mixed data is a current issue for outlier detection algorithms.An outlier detection method of mixed data based on neighborhood combinatorial entropy is studied to improve outlier detection performance by reducing data dimension using an attribute reduction algorithm.The significance of attributes is determined,and fewer influencing attributes are removed based on neighborhood combinatorial entropy.Outlier detection is conducted using the algorithm of local outlier factor.The proposed outlier detection method can be applied effectively in numerical and mixed multidimensional data using neighborhood combinatorial entropy.In the experimental part of this paper,we give a comparison on outlier detection before and after attribute reduction.In a comparative analysis,we give results of the enhanced outlier detection accuracy by removing the fewer influencing attributes in numerical and mixed multidimensional data.展开更多
Data envelopment analysis(DEA) model is widely used to evaluate the relative efficiency of producers. It is a kind of objective decision method with multiple indexes. However, the two basic models frequently used at p...Data envelopment analysis(DEA) model is widely used to evaluate the relative efficiency of producers. It is a kind of objective decision method with multiple indexes. However, the two basic models frequently used at present, the C2R model and the C2GS2 model have limitations when used alone,resulting in evaluations that are often unsatisfactory. In order to solve this problem, a mixed DEA model is built and is used to evaluate the validity of the business efficiency of listed companies. An explanation of how to use this mixed DEA model is offered and its feasibility is verified.展开更多
Classification and association rule mining are used to take decisions based on relationships between attributes and help decision makers to take correct decisions at right time. Associative classification first genera...Classification and association rule mining are used to take decisions based on relationships between attributes and help decision makers to take correct decisions at right time. Associative classification first generates class based association rules and use that generate rule set which is used to predict the class label for unseen data. The large data sets may have many null-transac- tions. A null-transaction is a transaction that does not contain any of the itemsets being examined. It is important to consider the null invariance property when selecting appropriate interesting measures in the correlation analysis. Real time data set has mixed attributes. Analyze the mixed attribute data set is not easy. Hence, the proposed work uses cosine measure to avoid the influence of null transactions during rule generation. It employs mixed-kernel probability density function (PDF) to handle continuous attributes during data analysis. It has ably to handle both nominal and continuous attributes and generates mixed attribute rule set. To explore the search space efficiently it applies Ant Colony Optimization (ACO). The public data sets are used to analyze the performance of the algorithm. The results illustrate that the support-confidence framework with a correlation measure generates more accurate simple rule set and discover more interesting rules.展开更多
The clustering of objects(individuals or variables)is one of the most used approaches to exploring multivariate data.The two most common unsupervised clustering strategies are hierarchical ascending clustering(HAC)and...The clustering of objects(individuals or variables)is one of the most used approaches to exploring multivariate data.The two most common unsupervised clustering strategies are hierarchical ascending clustering(HAC)and k-means partitioning used to identify groups of similar objects in a dataset to divide it into homogeneous groups.The proposed topological clustering of variables,called TCV,studies an homogeneous set of variables defined on the same set of individuals,based on the notion of neighborhood graphs,some of these variables are more-or-less correlated or linked according to the type quantitative or qualitative of the variables.This topological data analysis approach can then be useful for dimension reduction and variable selection.It’s a topological hierarchical clustering analysis of a set of variables which can be quantitative,qualitative or a mixture of both.It arranges variables into homogeneous groups according to their correlations or associations studied in a topological context of principal component analysis(PCA)or multiple correspondence analysis(MCA).The proposed TCV is adapted to the type of data considered,its principle is presented and illustrated using simple real datasets with quantitative,qualitative and mixed variables.The results of these illustrative examples are compared to those of other variables clustering approaches.展开更多
This paper takes stock price synchronization and price delay as indicators of information efficiency, and uses mixed cross-sectional data of listed companies into which Qualified Foreign Institutional Investors(QFII) ...This paper takes stock price synchronization and price delay as indicators of information efficiency, and uses mixed cross-sectional data of listed companies into which Qualified Foreign Institutional Investors(QFII) have made investments, to study the impact of QFII's investment behaviors on the information efficiency of China's stock market. The results show that QFII's investments can improve the information efficiency of China's stock market, but its impact is varied. The impact of QFII's investments on market information efficiency is more significant in bear markets than in bull markets, the impact on private enterprises is more significant than on state-owned enterprises, and the impact on Small and Medium Enterprises(SME) market is more significant than in main board market. Further research also finds that QFII has a certain threshold effect on the information efficiency of China's stock market. This research paper provides a problem-solving perspective for China's capital markets to achieve information efficiency through opening up, and at the same time warns against financial risks.展开更多
The impact of diabatic processes on 4-dimensional variational data assimilation (4D-Var) was studied using the 1995 version of NCEP's global spectral model with and without full physics.The adjoint was coded manua...The impact of diabatic processes on 4-dimensional variational data assimilation (4D-Var) was studied using the 1995 version of NCEP's global spectral model with and without full physics.The adjoint was coded manually.A cost function measuring spectral errors of 6-hour forecasts to 'observation' (the NCEP reanalysis data) was minimized using the L-BFGS (the limited memory quasi-Newton algorithm developed by Broyden,Fletcher,Goldfard and Shanno) for optimizing parameters and initial conditions.Minimization of the cost function constrained by an adiabatic version of the NCEP global model converged to a minimum with a significant amount of decrease in the value of the cost function.Minimization of the cost function using the diabatic model, however,failed after a few iterations due to discontinuities introduced by physical parameterizations.Examination of the convergence of the cost function in different spectral domains reveals that the large-scale flow is adjusted during the first 10 iterations,in which discontinuous diabatic parameterizations play very little role.The adjustment produced by the minimization gradually moves to relatively smaller scales between 10-20th iterations.During this transition period,discontinuities in the cost function produced by 'on-off' switches in the physical parameterizations caused the cost function to stay in a shallow local minimum instead of continuously decreasing toward a deeper minimum. Next,a mixed 4D-Var scheme is tested in which large-scale flows are first adiabatically adjusted to a sufficient level,followed by a diabatic adjustment introduced after 10 to 20 iterations. The mixed 4D-Var produced a closer fit of analysis to observations,with 38% and 41% more decrease in the values of the cost function and the norm of gradient,respectively,than the standard diabatic 4D-Var,while the CPU time is reduced by 21%.The resulting optimal initial conditions improve the short-range forecast skills of 48-hour statistics.The detrimental effect of parameterization discontinuities on minimization was also reduced.展开更多
新冠肺炎疫情冲击导致经济出现结构性变化,对通胀预测提出了新的挑战;而大数据时代的到来,则为提高通胀预测的时效性提供了新的机遇.本文据此围绕基于大数据的通胀“现时”预测(nowcasting)进行探索,提出一个基本的现时预测框架,其核心...新冠肺炎疫情冲击导致经济出现结构性变化,对通胀预测提出了新的挑战;而大数据时代的到来,则为提高通胀预测的时效性提供了新的机遇.本文据此围绕基于大数据的通胀“现时”预测(nowcasting)进行探索,提出一个基本的现时预测框架,其核心是引入新的大数据宏观实时变量或大数据预测方法.本文通过引入宏观实时变量--基于互联网在线大数据的居民消费价格指数(internet-based consumer price index,iCPI),包括总类和大类的iCPI日环比指数、周环比指数、旬同比指数和月同比指数,采用LASSO(the least absolute shrinkage and selection operator)降维法和混频数据抽样模型(mixed data sampling,MIDAS),有效地提高了通胀预测的时效性和准确性.研究发现:不同频率的iCPI均有利于提高通胀预测准确性,其表现优于基准模型和大部分的同频传统指标,当其与传统指标相结合时,可进一步降低预测误差,目前尚不能完全舍弃传统变量和方法;在不同频率下(日度除外),iCPI八大类的预测效果优于iCPI总类;不同频率的大数据指标在通胀预测的准确性和时效性上各有优势,这与其背后反映的信息结构有关,其中高频旬同比iCPI表现尤为突出、其能较好地兼顾预测时效性和准确性.本研究为数字经济时代利用大数据提高通胀预测的准确性和时效性、创新宏观经济监测与预测体系提供了有益参考.展开更多
We develop a recession forecasting framework using a less restrictive target variable and more flexible and inclusive specification than those used in the literature.The target variable captures the occurrence of a re...We develop a recession forecasting framework using a less restrictive target variable and more flexible and inclusive specification than those used in the literature.The target variable captures the occurrence of a recession within a given future period rather than at a specific future point in time(widely used in the literature).The modeling specification combines an autoregressive Logit model capturing the autocorrelation of business cycles,a dynamic factor model encompassing many economic and financial variables,and a mixed data sampling regression incorporating common factors with mixed sampling frequencies.The model gene rates significantly more accurate forecasts for U.S.recessions with smaller forecast errors and stronger early signals for the turning points of business cycles than those gene rated by existing models.展开更多
基金The authors would like to acknowledge the support of Southern Marine Science and Engineering Guangdong Laboratory(Zhuhai)(SML2020SP007)The paper is supported under the National Natural Science Foundation of China(Nos.61772280 and 62072249).
文摘Outlier detection is a key research area in data mining technologies,as outlier detection can identify data inconsistent within a data set.Outlier detection aims to find an abnormal data size from a large data size and has been applied in many fields including fraud detection,network intrusion detection,disaster prediction,medical diagnosis,public security,and image processing.While outlier detection has been widely applied in real systems,its effectiveness is challenged by higher dimensions and redundant data attributes,leading to detection errors and complicated calculations.The prevalence of mixed data is a current issue for outlier detection algorithms.An outlier detection method of mixed data based on neighborhood combinatorial entropy is studied to improve outlier detection performance by reducing data dimension using an attribute reduction algorithm.The significance of attributes is determined,and fewer influencing attributes are removed based on neighborhood combinatorial entropy.Outlier detection is conducted using the algorithm of local outlier factor.The proposed outlier detection method can be applied effectively in numerical and mixed multidimensional data using neighborhood combinatorial entropy.In the experimental part of this paper,we give a comparison on outlier detection before and after attribute reduction.In a comparative analysis,we give results of the enhanced outlier detection accuracy by removing the fewer influencing attributes in numerical and mixed multidimensional data.
基金Supported by Commission of Science Technology and Industry for National Defense(No, C192005C001)
文摘Data envelopment analysis(DEA) model is widely used to evaluate the relative efficiency of producers. It is a kind of objective decision method with multiple indexes. However, the two basic models frequently used at present, the C2R model and the C2GS2 model have limitations when used alone,resulting in evaluations that are often unsatisfactory. In order to solve this problem, a mixed DEA model is built and is used to evaluate the validity of the business efficiency of listed companies. An explanation of how to use this mixed DEA model is offered and its feasibility is verified.
文摘Classification and association rule mining are used to take decisions based on relationships between attributes and help decision makers to take correct decisions at right time. Associative classification first generates class based association rules and use that generate rule set which is used to predict the class label for unseen data. The large data sets may have many null-transac- tions. A null-transaction is a transaction that does not contain any of the itemsets being examined. It is important to consider the null invariance property when selecting appropriate interesting measures in the correlation analysis. Real time data set has mixed attributes. Analyze the mixed attribute data set is not easy. Hence, the proposed work uses cosine measure to avoid the influence of null transactions during rule generation. It employs mixed-kernel probability density function (PDF) to handle continuous attributes during data analysis. It has ably to handle both nominal and continuous attributes and generates mixed attribute rule set. To explore the search space efficiently it applies Ant Colony Optimization (ACO). The public data sets are used to analyze the performance of the algorithm. The results illustrate that the support-confidence framework with a correlation measure generates more accurate simple rule set and discover more interesting rules.
文摘The clustering of objects(individuals or variables)is one of the most used approaches to exploring multivariate data.The two most common unsupervised clustering strategies are hierarchical ascending clustering(HAC)and k-means partitioning used to identify groups of similar objects in a dataset to divide it into homogeneous groups.The proposed topological clustering of variables,called TCV,studies an homogeneous set of variables defined on the same set of individuals,based on the notion of neighborhood graphs,some of these variables are more-or-less correlated or linked according to the type quantitative or qualitative of the variables.This topological data analysis approach can then be useful for dimension reduction and variable selection.It’s a topological hierarchical clustering analysis of a set of variables which can be quantitative,qualitative or a mixture of both.It arranges variables into homogeneous groups according to their correlations or associations studied in a topological context of principal component analysis(PCA)or multiple correspondence analysis(MCA).The proposed TCV is adapted to the type of data considered,its principle is presented and illustrated using simple real datasets with quantitative,qualitative and mixed variables.The results of these illustrative examples are compared to those of other variables clustering approaches.
文摘This paper takes stock price synchronization and price delay as indicators of information efficiency, and uses mixed cross-sectional data of listed companies into which Qualified Foreign Institutional Investors(QFII) have made investments, to study the impact of QFII's investment behaviors on the information efficiency of China's stock market. The results show that QFII's investments can improve the information efficiency of China's stock market, but its impact is varied. The impact of QFII's investments on market information efficiency is more significant in bear markets than in bull markets, the impact on private enterprises is more significant than on state-owned enterprises, and the impact on Small and Medium Enterprises(SME) market is more significant than in main board market. Further research also finds that QFII has a certain threshold effect on the information efficiency of China's stock market. This research paper provides a problem-solving perspective for China's capital markets to achieve information efficiency through opening up, and at the same time warns against financial risks.
基金NSF grant ATM-9812729NOAA grant NA77WA0571Qiao is also supported by the Chinese National Key Basic Research Project under Contract G1999043809
文摘The impact of diabatic processes on 4-dimensional variational data assimilation (4D-Var) was studied using the 1995 version of NCEP's global spectral model with and without full physics.The adjoint was coded manually.A cost function measuring spectral errors of 6-hour forecasts to 'observation' (the NCEP reanalysis data) was minimized using the L-BFGS (the limited memory quasi-Newton algorithm developed by Broyden,Fletcher,Goldfard and Shanno) for optimizing parameters and initial conditions.Minimization of the cost function constrained by an adiabatic version of the NCEP global model converged to a minimum with a significant amount of decrease in the value of the cost function.Minimization of the cost function using the diabatic model, however,failed after a few iterations due to discontinuities introduced by physical parameterizations.Examination of the convergence of the cost function in different spectral domains reveals that the large-scale flow is adjusted during the first 10 iterations,in which discontinuous diabatic parameterizations play very little role.The adjustment produced by the minimization gradually moves to relatively smaller scales between 10-20th iterations.During this transition period,discontinuities in the cost function produced by 'on-off' switches in the physical parameterizations caused the cost function to stay in a shallow local minimum instead of continuously decreasing toward a deeper minimum. Next,a mixed 4D-Var scheme is tested in which large-scale flows are first adiabatically adjusted to a sufficient level,followed by a diabatic adjustment introduced after 10 to 20 iterations. The mixed 4D-Var produced a closer fit of analysis to observations,with 38% and 41% more decrease in the values of the cost function and the norm of gradient,respectively,than the standard diabatic 4D-Var,while the CPU time is reduced by 21%.The resulting optimal initial conditions improve the short-range forecast skills of 48-hour statistics.The detrimental effect of parameterization discontinuities on minimization was also reduced.
文摘新冠肺炎疫情冲击导致经济出现结构性变化,对通胀预测提出了新的挑战;而大数据时代的到来,则为提高通胀预测的时效性提供了新的机遇.本文据此围绕基于大数据的通胀“现时”预测(nowcasting)进行探索,提出一个基本的现时预测框架,其核心是引入新的大数据宏观实时变量或大数据预测方法.本文通过引入宏观实时变量--基于互联网在线大数据的居民消费价格指数(internet-based consumer price index,iCPI),包括总类和大类的iCPI日环比指数、周环比指数、旬同比指数和月同比指数,采用LASSO(the least absolute shrinkage and selection operator)降维法和混频数据抽样模型(mixed data sampling,MIDAS),有效地提高了通胀预测的时效性和准确性.研究发现:不同频率的iCPI均有利于提高通胀预测准确性,其表现优于基准模型和大部分的同频传统指标,当其与传统指标相结合时,可进一步降低预测误差,目前尚不能完全舍弃传统变量和方法;在不同频率下(日度除外),iCPI八大类的预测效果优于iCPI总类;不同频率的大数据指标在通胀预测的准确性和时效性上各有优势,这与其背后反映的信息结构有关,其中高频旬同比iCPI表现尤为突出、其能较好地兼顾预测时效性和准确性.本研究为数字经济时代利用大数据提高通胀预测的准确性和时效性、创新宏观经济监测与预测体系提供了有益参考.
基金funding from School of Accounting and Finance,Faculty of Business,Hong Kong Polytechnic University.
文摘We develop a recession forecasting framework using a less restrictive target variable and more flexible and inclusive specification than those used in the literature.The target variable captures the occurrence of a recession within a given future period rather than at a specific future point in time(widely used in the literature).The modeling specification combines an autoregressive Logit model capturing the autocorrelation of business cycles,a dynamic factor model encompassing many economic and financial variables,and a mixed data sampling regression incorporating common factors with mixed sampling frequencies.The model gene rates significantly more accurate forecasts for U.S.recessions with smaller forecast errors and stronger early signals for the turning points of business cycles than those gene rated by existing models.