Recent advances in deep learning have expanded new possibilities for fluid flow simulation in petroleum reservoirs.However,the predominant approach in existing research is to train neural networks using high-fidelity ...Recent advances in deep learning have expanded new possibilities for fluid flow simulation in petroleum reservoirs.However,the predominant approach in existing research is to train neural networks using high-fidelity numerical simulation data.This presents a significant challenge because the sole source of authentic wellbore production data for training is sparse.In response to this challenge,this work introduces a novel architecture called physics-informed neural network based on domain decomposition(PINN-DD),aiming to effectively utilize the sparse production data of wells for reservoir simulation with large-scale systems.To harness the capabilities of physics-informed neural networks(PINNs)in handling small-scale spatial-temporal domain while addressing the challenges of large-scale systems with sparse labeled data,the computational domain is divided into two distinct sub-domains:the well-containing and the well-free sub-domain.Moreover,the two sub-domains and the interface are rigorously constrained by the governing equations,data matching,and boundary conditions.The accuracy of the proposed method is evaluated on two problems,and its performance is compared against state-of-the-art PINNs through numerical analysis as a benchmark.The results demonstrate the superiority of PINN-DD in handling large-scale reservoir simulation with limited data and show its potential to outperform conventional PINNs in such scenarios.展开更多
For a data cube there are always constraints between dimensions or among attributes in a dimension, such as functional dependencies. We introduce the problem that when there are functional dependencies, how to use the...For a data cube there are always constraints between dimensions or among attributes in a dimension, such as functional dependencies. We introduce the problem that when there are functional dependencies, how to use them to speed up the computation of sparse data cubes. A new algorithm CFD (Computation by Functional Dependencies) is presented to satisfy this demand. CFD determines the order of dimensions by considering cardinalities of dimensions and functional dependencies between dimensions together, thus reduce the number of partitions for such dimensions. CFD also combines partitioning from bottom to up and aggregate computation from top to bottom to speed up the computation further. CFD can efficiently compute a data cube with hierarchies in a dimension from the smallest granularity to the coarsest one. Key words sparse data cube - functional dependency - dimension - partition - CFD CLC number TP 311 Foundation item: Supported by the E-Government Project of the Ministry of Science and Technology of China (2001BA110B01)Biography: Feng Yu-cai (1945-), male, Professor, research direction: database system.展开更多
Deep learning has been probed for the airfoil performance prediction in recent years.Compared with the expensive CFD simulations and wind tunnel experiments,deep learning models can be leveraged to somewhat mitigate s...Deep learning has been probed for the airfoil performance prediction in recent years.Compared with the expensive CFD simulations and wind tunnel experiments,deep learning models can be leveraged to somewhat mitigate such expenses with proper means.Nevertheless,effective training of the data-driven models in deep learning severely hinges on the data in diversity and quantity.In this paper,we present a novel data augmented Generative Adversarial Network(GAN),daGAN,for rapid and accurate flow filed prediction,allowing the adaption to the task with sparse data.The presented approach consists of two modules,pre-training module and fine-tuning module.The pre-training module utilizes a conditional GAN(cGAN)to preliminarily estimate the distribution of the training data.In the fine-tuning module,we propose a novel adversarial architecture with two generators one of which fulfils a promising data augmentation operation,so that the complement data is adequately incorporated to boost the generalization of the model.We use numerical simulation data to verify the generalization of daGAN on airfoils and flow conditions with sparse training data.The results show that daGAN is a promising tool for rapid and accurate evaluation of detailed flow field without the requirement for big training data.展开更多
Various uncertainties arising during acquisition process of geoscience data may result in anomalous data instances(i.e.,outliers)that do not conform with the expected pattern of regular data instances.With sparse mult...Various uncertainties arising during acquisition process of geoscience data may result in anomalous data instances(i.e.,outliers)that do not conform with the expected pattern of regular data instances.With sparse multivariate data obtained from geotechnical site investigation,it is impossible to identify outliers with certainty due to the distortion of statistics of geotechnical parameters caused by outliers and their associated statistical uncertainty resulted from data sparsity.This paper develops a probabilistic outlier detection method for sparse multivariate data obtained from geotechnical site investigation.The proposed approach quantifies the outlying probability of each data instance based on Mahalanobis distance and determines outliers as those data instances with outlying probabilities greater than 0.5.It tackles the distortion issue of statistics estimated from the dataset with outliers by a re-sampling technique and accounts,rationally,for the statistical uncertainty by Bayesian machine learning.Moreover,the proposed approach also suggests an exclusive method to determine outlying components of each outlier.The proposed approach is illustrated and verified using simulated and real-life dataset.It showed that the proposed approach properly identifies outliers among sparse multivariate data and their corresponding outlying components in a probabilistic manner.It can significantly reduce the masking effect(i.e.,missing some actual outliers due to the distortion of statistics by the outliers and statistical uncertainty).It also found that outliers among sparse multivariate data instances affect significantly the construction of multivariate distribution of geotechnical parameters for uncertainty quantification.This emphasizes the necessity of data cleaning process(e.g.,outlier detection)for uncertainty quantification based on geoscience data.展开更多
The paper aims to discuss three interesting issues of statistical inferences for a common risk ratio (RR) in sparse meta-analysis data. Firstly, the conventional log-risk ratio estimator encounters a number of problem...The paper aims to discuss three interesting issues of statistical inferences for a common risk ratio (RR) in sparse meta-analysis data. Firstly, the conventional log-risk ratio estimator encounters a number of problems when the number of events in the experimental or control group is zero in sparse data of a 2 × 2 table. The adjusted log-risk ratio estimator with the continuity correction points based upon the minimum Bayes risk with respect to the uniform prior density over (0, 1) and the Euclidean loss function is proposed. Secondly, the interest is to find the optimal weights of the pooled estimate that minimize the mean square error (MSE) of subject to the constraint on where , , . Finally, the performance of this minimum MSE weighted estimator adjusted with various values of points is investigated to compare with other popular estimators, such as the Mantel-Haenszel (MH) estimator and the weighted least squares (WLS) estimator (also equivalently known as the inverse-variance weighted estimator) in senses of point estimation and hypothesis testing via simulation studies. The results of estimation illustrate that regardless of the true values of RR, the MH estimator achieves the best performance with the smallest MSE when the study size is rather large and the sample sizes within each study are small. The MSE of WLS estimator and the proposed-weight estimator adjusted by , or , or are close together and they are the best when the sample sizes are moderate to large (and) while the study size is rather small.展开更多
We propose a novel filter for sparse big data,called an integrated autoencoder(IAE),which utilises auxiliary information to mitigate data sparsity.The proposed model achieves an appropriate balance between prediction ...We propose a novel filter for sparse big data,called an integrated autoencoder(IAE),which utilises auxiliary information to mitigate data sparsity.The proposed model achieves an appropriate balance between prediction accuracy,convergence speed,and complexity.We implement experiments on a GPS trajectory dataset,and the results demonstrate that the IAE is more accurate and robust than some state-of-the-art methods.展开更多
Sparse and irregular climate observations in many developing countries are not enough to satisfy the need of assessing climate change risks and planning suitable mitigation strategies.The wideused statistical downscal...Sparse and irregular climate observations in many developing countries are not enough to satisfy the need of assessing climate change risks and planning suitable mitigation strategies.The wideused statistical downscaling model(SDSM)software tools use multi-linear regression to extract linear relations between largescale and local climate variables and then produce high-resolution climate maps from sparse climate observations.The latest machine learning techniques(e.g.SRCNN,SRGAN)can extract nonlinear links,but they are only suitable for downscaling low-resolution grid data and cannot utilize the link to other climate variables to improve the downscaling performance.In this study,we proposed a novel hybrid RBF(Radial Basis Function)network by embedding several RBF networks into new RBF networks.Our model can well incorporate climate and topographical variables with different resolutions and extract their nonlinear relations for spatial downscaling.To test the performance of our model,we generated high-resolution precipitation,air temperature and humidity maps from 34 meteorological stations in Bangladesh.In terms of three statistical indicators,the accuracy of high-resolution climate maps generated by our hybrid RBF network clearly outperformed those using a multi-linear regression(MLR),Kriging interpolation or a pure RBF network.展开更多
Click-through-rate(CTR)prediction is a crucial task in recommendation systems.The accuracy of CTR prediction is strongly influenced by the precise extraction of essential data and the modeling strategy chosen.The data...Click-through-rate(CTR)prediction is a crucial task in recommendation systems.The accuracy of CTR prediction is strongly influenced by the precise extraction of essential data and the modeling strategy chosen.The data of the CTR task are often very sparse,and Factorization Machines(FMs)are a class of general predictors working effectively with it.However,the performance of FMs can be limited by the fixed feature representation and the same weight of different features.In this work,we propose an improved Bitwise Feature Importance Factorization Machine(BFIFM)to improve the accuracy.The necessity of learning the degree of effect of the same feature under various situations is learned through the low-order intersection method,and the deep neural network(DNN)in our model is used in parallel to study high-order intersections.According to the final results obtained,the BFIFM model significantly outperforms other state-of-the-art models.展开更多
Data sparseness, the evident characteristic of short text, has always been regarded as the main cause of the low ac- curacy in the classification of short texts using statistical methods. Intensive research has been c...Data sparseness, the evident characteristic of short text, has always been regarded as the main cause of the low ac- curacy in the classification of short texts using statistical methods. Intensive research has been conducted in this area during the past decade. However, most researchers failed to notice that ignoring the semantic importance of certain feature terms might also contribute to low classification accuracy. In this paper we present a new method to tackle the problem by building a strong feature thesaurus (SFT) based on latent Dirichlet allocation (LDA) and information gain (IG) models. By giving larger weights to feature terms in SFT, the classification accuracy can be improved. Specifically, our method appeared to be more effective with more detailed classification. Experiments in two short text datasets demonstrate that our approach achieved improvement compared with the state-of-the-art methods including support vector machine (SVM) and Naive Bayes Multinomial.展开更多
基金funded by the National Natural Science Foundation of China(Grant No.52274048)Beijing Natural Science Foundation(Grant No.3222037)+1 种基金the CNPC 14th Five-Year Perspective Fundamental Research Project(Grant No.2021DJ2104)the Science Foundation of China University of Petroleum-Beijing(No.2462021YXZZ010).
文摘Recent advances in deep learning have expanded new possibilities for fluid flow simulation in petroleum reservoirs.However,the predominant approach in existing research is to train neural networks using high-fidelity numerical simulation data.This presents a significant challenge because the sole source of authentic wellbore production data for training is sparse.In response to this challenge,this work introduces a novel architecture called physics-informed neural network based on domain decomposition(PINN-DD),aiming to effectively utilize the sparse production data of wells for reservoir simulation with large-scale systems.To harness the capabilities of physics-informed neural networks(PINNs)in handling small-scale spatial-temporal domain while addressing the challenges of large-scale systems with sparse labeled data,the computational domain is divided into two distinct sub-domains:the well-containing and the well-free sub-domain.Moreover,the two sub-domains and the interface are rigorously constrained by the governing equations,data matching,and boundary conditions.The accuracy of the proposed method is evaluated on two problems,and its performance is compared against state-of-the-art PINNs through numerical analysis as a benchmark.The results demonstrate the superiority of PINN-DD in handling large-scale reservoir simulation with limited data and show its potential to outperform conventional PINNs in such scenarios.
文摘For a data cube there are always constraints between dimensions or among attributes in a dimension, such as functional dependencies. We introduce the problem that when there are functional dependencies, how to use them to speed up the computation of sparse data cubes. A new algorithm CFD (Computation by Functional Dependencies) is presented to satisfy this demand. CFD determines the order of dimensions by considering cardinalities of dimensions and functional dependencies between dimensions together, thus reduce the number of partitions for such dimensions. CFD also combines partitioning from bottom to up and aggregate computation from top to bottom to speed up the computation further. CFD can efficiently compute a data cube with hierarchies in a dimension from the smallest granularity to the coarsest one. Key words sparse data cube - functional dependency - dimension - partition - CFD CLC number TP 311 Foundation item: Supported by the E-Government Project of the Ministry of Science and Technology of China (2001BA110B01)Biography: Feng Yu-cai (1945-), male, Professor, research direction: database system.
基金supported by the funding of the Key Laboratory of Aerodynamic Noise Control(No.ANCL20190103)the State Key Laboratory of Aerodynamics,China(No.SKLA20180102)+1 种基金the Aeronautical Science Foundation of China(Nos.2018ZA52002,2019ZA052011)the Priority Academic Program Development of Jiangsu Higher Education Institutions,China(PAPD).
文摘Deep learning has been probed for the airfoil performance prediction in recent years.Compared with the expensive CFD simulations and wind tunnel experiments,deep learning models can be leveraged to somewhat mitigate such expenses with proper means.Nevertheless,effective training of the data-driven models in deep learning severely hinges on the data in diversity and quantity.In this paper,we present a novel data augmented Generative Adversarial Network(GAN),daGAN,for rapid and accurate flow filed prediction,allowing the adaption to the task with sparse data.The presented approach consists of two modules,pre-training module and fine-tuning module.The pre-training module utilizes a conditional GAN(cGAN)to preliminarily estimate the distribution of the training data.In the fine-tuning module,we propose a novel adversarial architecture with two generators one of which fulfils a promising data augmentation operation,so that the complement data is adequately incorporated to boost the generalization of the model.We use numerical simulation data to verify the generalization of daGAN on airfoils and flow conditions with sparse training data.The results show that daGAN is a promising tool for rapid and accurate evaluation of detailed flow field without the requirement for big training data.
基金supported by the National Key R&D Program of China(Project No.2016YFC0800200)the NRF-NSFC 3rd Joint Research Grant(Earth Science)(Project No.41861144022)+2 种基金the National Natural Science Foundation of China(Project Nos.51679174,and 51779189)the Shenzhen Key Technology R&D Program(Project No.20170324)The financial support is grateful acknowledged。
文摘Various uncertainties arising during acquisition process of geoscience data may result in anomalous data instances(i.e.,outliers)that do not conform with the expected pattern of regular data instances.With sparse multivariate data obtained from geotechnical site investigation,it is impossible to identify outliers with certainty due to the distortion of statistics of geotechnical parameters caused by outliers and their associated statistical uncertainty resulted from data sparsity.This paper develops a probabilistic outlier detection method for sparse multivariate data obtained from geotechnical site investigation.The proposed approach quantifies the outlying probability of each data instance based on Mahalanobis distance and determines outliers as those data instances with outlying probabilities greater than 0.5.It tackles the distortion issue of statistics estimated from the dataset with outliers by a re-sampling technique and accounts,rationally,for the statistical uncertainty by Bayesian machine learning.Moreover,the proposed approach also suggests an exclusive method to determine outlying components of each outlier.The proposed approach is illustrated and verified using simulated and real-life dataset.It showed that the proposed approach properly identifies outliers among sparse multivariate data and their corresponding outlying components in a probabilistic manner.It can significantly reduce the masking effect(i.e.,missing some actual outliers due to the distortion of statistics by the outliers and statistical uncertainty).It also found that outliers among sparse multivariate data instances affect significantly the construction of multivariate distribution of geotechnical parameters for uncertainty quantification.This emphasizes the necessity of data cleaning process(e.g.,outlier detection)for uncertainty quantification based on geoscience data.
文摘The paper aims to discuss three interesting issues of statistical inferences for a common risk ratio (RR) in sparse meta-analysis data. Firstly, the conventional log-risk ratio estimator encounters a number of problems when the number of events in the experimental or control group is zero in sparse data of a 2 × 2 table. The adjusted log-risk ratio estimator with the continuity correction points based upon the minimum Bayes risk with respect to the uniform prior density over (0, 1) and the Euclidean loss function is proposed. Secondly, the interest is to find the optimal weights of the pooled estimate that minimize the mean square error (MSE) of subject to the constraint on where , , . Finally, the performance of this minimum MSE weighted estimator adjusted with various values of points is investigated to compare with other popular estimators, such as the Mantel-Haenszel (MH) estimator and the weighted least squares (WLS) estimator (also equivalently known as the inverse-variance weighted estimator) in senses of point estimation and hypothesis testing via simulation studies. The results of estimation illustrate that regardless of the true values of RR, the MH estimator achieves the best performance with the smallest MSE when the study size is rather large and the sample sizes within each study are small. The MSE of WLS estimator and the proposed-weight estimator adjusted by , or , or are close together and they are the best when the sample sizes are moderate to large (and) while the study size is rather small.
基金The work was supported by the National Social Science Foundation of China[No.16FJY008]the National Planning Office of Philosophy and Social Science[No.11801060]the Natural Science Foundation of Shandong Province[No.ZR2016FM26].
文摘We propose a novel filter for sparse big data,called an integrated autoencoder(IAE),which utilises auxiliary information to mitigate data sparsity.The proposed model achieves an appropriate balance between prediction accuracy,convergence speed,and complexity.We implement experiments on a GPS trajectory dataset,and the results demonstrate that the IAE is more accurate and robust than some state-of-the-art methods.
基金supported by the European Commission's Horizon 2020 Framework Program(no.861584),and the Taishan distinguished professorship fund.
文摘Sparse and irregular climate observations in many developing countries are not enough to satisfy the need of assessing climate change risks and planning suitable mitigation strategies.The wideused statistical downscaling model(SDSM)software tools use multi-linear regression to extract linear relations between largescale and local climate variables and then produce high-resolution climate maps from sparse climate observations.The latest machine learning techniques(e.g.SRCNN,SRGAN)can extract nonlinear links,but they are only suitable for downscaling low-resolution grid data and cannot utilize the link to other climate variables to improve the downscaling performance.In this study,we proposed a novel hybrid RBF(Radial Basis Function)network by embedding several RBF networks into new RBF networks.Our model can well incorporate climate and topographical variables with different resolutions and extract their nonlinear relations for spatial downscaling.To test the performance of our model,we generated high-resolution precipitation,air temperature and humidity maps from 34 meteorological stations in Bangladesh.In terms of three statistical indicators,the accuracy of high-resolution climate maps generated by our hybrid RBF network clearly outperformed those using a multi-linear regression(MLR),Kriging interpolation or a pure RBF network.
基金supported by Hainan Province Science and Technology Special Fund,which is Research and Application of Intelligent Recommendation Technology Based on Knowledge Graph and User Portrait (No.ZDYF2020039)。
文摘Click-through-rate(CTR)prediction is a crucial task in recommendation systems.The accuracy of CTR prediction is strongly influenced by the precise extraction of essential data and the modeling strategy chosen.The data of the CTR task are often very sparse,and Factorization Machines(FMs)are a class of general predictors working effectively with it.However,the performance of FMs can be limited by the fixed feature representation and the same weight of different features.In this work,we propose an improved Bitwise Feature Importance Factorization Machine(BFIFM)to improve the accuracy.The necessity of learning the degree of effect of the same feature under various situations is learned through the low-order intersection method,and the deep neural network(DNN)in our model is used in parallel to study high-order intersections.According to the final results obtained,the BFIFM model significantly outperforms other state-of-the-art models.
基金Project (No. 20111081023) supported by the Tsinghua University Initiative Scientific Research Program, China
文摘Data sparseness, the evident characteristic of short text, has always been regarded as the main cause of the low ac- curacy in the classification of short texts using statistical methods. Intensive research has been conducted in this area during the past decade. However, most researchers failed to notice that ignoring the semantic importance of certain feature terms might also contribute to low classification accuracy. In this paper we present a new method to tackle the problem by building a strong feature thesaurus (SFT) based on latent Dirichlet allocation (LDA) and information gain (IG) models. By giving larger weights to feature terms in SFT, the classification accuracy can be improved. Specifically, our method appeared to be more effective with more detailed classification. Experiments in two short text datasets demonstrate that our approach achieved improvement compared with the state-of-the-art methods including support vector machine (SVM) and Naive Bayes Multinomial.