Connes' distance formula is applied to endow linear metric to three 1D lattices of different topologies with a generalization of lattice Dirac operator written down by Dimakis et al.to contain a non-unitary link-v...Connes' distance formula is applied to endow linear metric to three 1D lattices of different topologies with a generalization of lattice Dirac operator written down by Dimakis et al.to contain a non-unitary link-variable.Geometric interpretation of this link-variable is lattice spacing and parallel transport.展开更多
Data processing of small samples is an important and valuable research problem in the electronic equipment test. Because it is difficult and complex to determine the probability distribution of small samples, it is di...Data processing of small samples is an important and valuable research problem in the electronic equipment test. Because it is difficult and complex to determine the probability distribution of small samples, it is difficult to use the traditional probability theory to process the samples and assess the degree of uncertainty. Using the grey relational theory and the norm theory, the grey distance information approach, which is based on the grey distance information quantity of a sample and the average grey distance information quantity of the samples, is proposed in this article. The definitions of the grey distance information quantity of a sample and the average grey distance information quantity of the samples, with their characteristics and algorithms, are introduced. The correlative problems, including the algorithm of estimated value, the standard deviation, and the acceptance and rejection criteria of the samples and estimated results, are also proposed. Moreover, the information whitening ratio is introduced to select the weight algorithm and to compare the different samples. Several examples are given to demonstrate the application of the proposed approach. The examples show that the proposed approach, which has no demand for the probability distribution of small samples, is feasible and effective.展开更多
Machine Learning(ML)systems often involve a re-training process to make better predictions and classifications.This re-training process creates a loophole and poses a security threat for ML systems.Adversaries leverag...Machine Learning(ML)systems often involve a re-training process to make better predictions and classifications.This re-training process creates a loophole and poses a security threat for ML systems.Adversaries leverage this loophole and design data poisoning attacks against ML systems.Data poisoning attacks are a type of attack in which an adversary manipulates the training dataset to degrade the ML system’s performance.Data poisoning attacks are challenging to detect,and even more difficult to respond to,particularly in the Internet of Things(IoT)environment.To address this problem,we proposed DISTINIT,the first proactive data poisoning attack detection framework using distancemeasures.We found that Jaccard Distance(JD)can be used in the DISTINIT(among other distance measures)and we finally improved the JD to attain an Optimized JD(OJD)with lower time and space complexity.Our security analysis shows that the DISTINIT is secure against data poisoning attacks by considering key features of adversarial attacks.We conclude that the proposed OJD-based DISTINIT is effective and efficient against data poisoning attacks where in-time detection is critical for IoT applications with large volumes of streaming data.展开更多
Learning unlabeled data is a significant challenge that needs to han-dle complicated relationships between nominal values and attributes.Increas-ingly,recent research on learning value relations within and between att...Learning unlabeled data is a significant challenge that needs to han-dle complicated relationships between nominal values and attributes.Increas-ingly,recent research on learning value relations within and between attributes has shown significant improvement in clustering and outlier detection,etc.However,typical existing work relies on learning pairwise value relations but weakens or overlooks the direct couplings between multiple attributes.This paper thus proposes two novel and flexible multi-attribute couplings-based distance(MCD)metrics,which learn the multi-attribute couplings and their strengths in nominal data based on information theories:self-information,entropy,and mutual information,for measuring both numerical and nominal distances.MCD enables the application of numerical and nominal clustering methods on nominal data and quantifies the influence of involving and filtering multi-attribute couplings on distance learning and clustering perfor-mance.Substantial experiments evidence the above conclusions on 15 data sets against seven state-of-the-art distance measures with various feature selection methods for both numerical and nominal clustering.展开更多
The 1970-1985 day to day averaged pressure dataset of Shanghai and the extension method in phase space are used to calculate the correlation dimension D and the second-order Renyi entropy K2 of the approximation of Ko...The 1970-1985 day to day averaged pressure dataset of Shanghai and the extension method in phase space are used to calculate the correlation dimension D and the second-order Renyi entropy K2 of the approximation of Kolmogorov's entropy, the fractional dimension D = 7.7-7.9 and the positive value K2 - 0.1 are obtained. This shows that the attractor for the short-term weather evolution in the monsoon region of China exhibits a chaotic motion. The estimate of K2 yields a predictable time scale of about ten days. This result is in agreement with that obtained earlier by the dynamic-statistical approach.The effects of the lag time i on the estimate of D and K2 are investigated. The results show that D and K2 are convergent with respect to i. The day to day averaged pressure series used in this paper are treated for the extensive phase space with T = 5, the coordinate components are independent of each other; therefore, the dynamical character quantities of the system are stable and reliable.展开更多
Based on the framework of evidence theory, data fusion aims at obtaining a single Basic Probability Assignment (BPA) function by combining several belief functions from distinct information sources. Dempster’s rule o...Based on the framework of evidence theory, data fusion aims at obtaining a single Basic Probability Assignment (BPA) function by combining several belief functions from distinct information sources. Dempster’s rule of combination is the most popular rule of combinations, but it is a poor solution for the management of the conflict between various information sources at the normalization step. Even when it faces high conflict information, the classical Dempster-Shafer’s (D-S) evidence theory can involve counter-intuitive results. This paper presents a modified averaging method to combine conflicting evidence based on the distance of evidences; and also gives the weighted average of the evidence in the system. Numerical examples showed that the proposed method can realize the modification ideas and also will provide reasonable results with good convergence efficiency.展开更多
A physical retrieval approach based on the one-dimensional variational(1 D-Var) algorithm is applied in this paper to simultaneously retrieve atmospheric temperature and humidity profiles under both clear-sky and part...A physical retrieval approach based on the one-dimensional variational(1 D-Var) algorithm is applied in this paper to simultaneously retrieve atmospheric temperature and humidity profiles under both clear-sky and partly cloudy conditions from FY-4 A GIIRS(geostationary interferometric infrared sounder) observations. Radiosonde observations from upper-air stations in China and level-2 operational products from the Chinese National Satellite Meteorological Center(NSMC)during the periods from December 2019 to January 2020(winter) and from July 2020 to August 2020(summer) are used to validate the accuracies of the retrieved temperature and humidity profiles. Comparing the 1 D-Var-retrieved profiles to radiosonde data, the accuracy of the temperature retrievals at each vertical level of the troposphere is characterized by a root mean square error(RMSE) within 2 K, except for at the bottom level of the atmosphere under clear conditions. The RMSE increases slightly for the higher atmospheric layers, owing to the lack of temperature sounding channels there.Under partly cloudy conditions, the temperature at each vertical level can be obtained, while the level-2 operational products obtain values only at altitudes above the cloud top. In addition, the accuracy of the retrieved temperature profiles is greatly improved compared with the accuracies of the operational products. For the humidity retrievals, the mean RMSEs in the troposphere in winter and summer are both within 2 g kg^(–1). Moreover, the retrievals performed better compared with the ERA5 reanalysis data between 800 h Pa and 300 h Pa both in summer and winter in terms of RMSE.展开更多
A knowledge of soil permeability is essential to evaluate hydrologic characteristics of soil, such as water storage and water movement, and soil permeability coefficient is an important parameter that reflects soil pe...A knowledge of soil permeability is essential to evaluate hydrologic characteristics of soil, such as water storage and water movement, and soil permeability coefficient is an important parameter that reflects soil permeability. In order to confirm the acceptability of the one-dimensional horizontal infiltration method(one-D method) for simultaneously determining both the saturated and unsaturated permeability coefficients of loamy sand, we first measured the cumulative infiltration and the wetting front distance under various infiltration heads through a series of one-dimensional horizontal infiltration experiments, and then analyzed the relationships of the cumulative horizontal infiltration with the wetting front distance and the square root of infiltration time. We finally compared the permeability results from Gardner model based on the one-D method with the results from other two commonly-used methods(i.e., constant head method and van Genuchten model) to evaluate the acceptability and applicability of the one-D method. The results showed that there was a robust linear relationship between the cumulative horizontal infiltration and the wetting front distance, suggesting that it is more appropriate to take the soil moisture content after infiltration in the entire wetted zone as the average soil moisture content than as the saturated soil moisture content. The results also showed that there was a robust linear relationship between the cumulative horizontal infiltration and the square root of infiltration time, suggesting that the Philip infiltration formula can better reflect the characteristics of cumulative horizontal infiltration under different infiltration heads. The following two facts indicate that it is feasible to use the one-D method for simultaneously determining the saturated and unsaturated permeability coefficients of loamy sand. First, the saturated permeability coefficient(prescribed in the Gardner model) of loamy sand obtained from the one-D method well agreed with the value obtained from the constant head method. Second, the relationship of unsaturated permeability coefficient with soil water suction for loamy sand calculated using Gardner model based on the one-D method was nearly identical with the same relationship calculated using van Genuchten model.展开更多
An extension of 2-D assignment approach is proposed for measurement-to-target association for improving multiple targets vector miss distance measurement accuracy. When the multiple targets move so closely, the measur...An extension of 2-D assignment approach is proposed for measurement-to-target association for improving multiple targets vector miss distance measurement accuracy. When the multiple targets move so closely, the measurements can not be fully resolved due to finite resolution. The proposed method adopts an auction algorithm to compute the feasible measurement-to-target assignment with unresolved measurements for solving this 2-D assignment problem. Computer simulation results demonstrate the effectiveness and feasibility of this method.展开更多
Seismic waveform clustering is a useful technique for lithologic identification and reservoir characterization.The current seismic waveform clustering algorithms are predominantly based on a fixed time window,which is...Seismic waveform clustering is a useful technique for lithologic identification and reservoir characterization.The current seismic waveform clustering algorithms are predominantly based on a fixed time window,which is applicable for layers of stable thickness.When a layer exhibits variable thickness in the seismic response,a fixed time window cannot provide comprehensive geologic information for the target interval.Therefore,we propose a novel approach for a waveform clustering workfl ow based on a variable time window to enable broader applications.The dynamic time warping(DTW)distance is fi rst introduced to effectively measure the similarities between seismic waveforms with various lengths.We develop a DTW distance-based clustering algorithm to extract centroids,and we then determine the class of all seismic traces according to the DTW distances from centroids.To greatly reduce the computational complexity in seismic data application,we propose a superpixel-based seismic data thinning approach.We further propose an integrated workfl ow that can be applied to practical seismic data by incorporating the DTW distance-based clustering and seismic data thinning algorithms.We evaluated the performance by applying the proposed workfl ow to synthetic seismograms and seismic survey data.Compared with the the traditional waveform clustering method,the synthetic seismogram results demonstrate the enhanced capability of the proposed workfl ow to detect boundaries of diff erent lithologies or lithologic associations with variable thickness.Results from a practical application show that the planar map of seismic waveform clustering obtained by the proposed workfl ow correlates well with the geological characteristics of wells in terms of reservoir thickness.展开更多
There are numerous application areas of computing similarity between process models.It includes finding similar models from a repository,controlling redundancy of process models,and finding corresponding activities be...There are numerous application areas of computing similarity between process models.It includes finding similar models from a repository,controlling redundancy of process models,and finding corresponding activities between a pair of process models.The similarity between two process models is computed based on their similarity between labels,structures,and execution behaviors.Several attempts have been made to develop similarity techniques between activity labels,as well as their execution behavior.However,a notable problem with the process model similarity is that two process models can also be similar if there is a structural variation between them.However,neither a benchmark dataset exists for the structural similarity between process models nor there exist an effective technique to compute structural similarity.To that end,we have developed a large collection of process models in which structural changes are handcrafted while preserving the semantics of the models.Furthermore,we have used a machine learning-based approach to compute the similarity between a pair of process models having structural and label differences.Finally,we have evaluated the proposed approach using our generated collection of process models.展开更多
When building a classification model,the scenario where the samples of one class are significantly more than those of the other class is called data imbalance.Data imbalance causes the trained classification model to ...When building a classification model,the scenario where the samples of one class are significantly more than those of the other class is called data imbalance.Data imbalance causes the trained classification model to be in favor of the majority class(usually defined as the negative class),which may do harm to the accuracy of the minority class(usually defined as the positive class),and then lead to poor overall performance of the model.A method called MSHR-FCSSVM for solving imbalanced data classification is proposed in this article,which is based on a new hybrid resampling approach(MSHR)and a new fine cost-sensitive support vector machine(CS-SVM)classifier(FCSSVM).The MSHR measures the separability of each negative sample through its Silhouette value calculated by Mahalanobis distance between samples,based on which,the so-called pseudo-negative samples are screened out to generate new positive samples(over-sampling step)through linear interpolation and are deleted finally(under-sampling step).This approach replaces pseudo-negative samples with generated new positive samples one by one to clear up the inter-class overlap on the borderline,without changing the overall scale of the dataset.The FCSSVM is an improved version of the traditional CS-SVM.It considers influences of both the imbalance of sample number and the class distribution on classification simultaneously,and through finely tuning the class cost weights by using the efficient optimization algorithm based on the physical phenomenon of rime-ice(RIME)algorithm with cross-validation accuracy as the fitness function to accurately adjust the classification borderline.To verify the effectiveness of the proposed method,a series of experiments are carried out based on 20 imbalanced datasets including both mildly and extremely imbalanced datasets.The experimental results show that the MSHR-FCSSVM method performs better than the methods for comparison in most cases,and both the MSHR and the FCSSVM played significant roles.展开更多
Most clustering algorithms need to describe the similarity of objects by a predefined distance function. Three distance functions which are widely used in two traditional clustering algorithms k-means and hierarchical...Most clustering algorithms need to describe the similarity of objects by a predefined distance function. Three distance functions which are widely used in two traditional clustering algorithms k-means and hierarchical clustering were investigated. Both theoretical analysis and detailed experimental results were given. It is shown that a distance function greatly affects clustering results and can be used to detect the outlier of a cluster by the comparison of such different results and give the shape information of clusters. In practice situation, it is suggested to use different distance function separately, compare the clustering results and pick out the 搒wing points? And such points may leak out more information for data analysts.展开更多
Cache performance tuning tools are conducive to develop program with good locality and fully use cache to decrease the influence caused by speed gap between processor and memory. This paper introduces the design and i...Cache performance tuning tools are conducive to develop program with good locality and fully use cache to decrease the influence caused by speed gap between processor and memory. This paper introduces the design and implementation of a cache performance tuning tool named CTuning, which employs a source level instrumentation method to gather program data access information, and uses a limited reuse distance model to analyze cache behavior. Experiments on 183.equake improve average performance more than 6% and show that CTuning is proficient not only in locating cache performance bottlenecks to guide manual code transformation, but also in analyzing cache behavior relationship among variables, thus to direct manual data reorganization.展开更多
Based on the framework of support vector machines (SVM) using one-against-one (OAO) strategy, a new multi-class kernel method based on directed aeyclie graph (DAG) and probabilistic distance is proposed to raise...Based on the framework of support vector machines (SVM) using one-against-one (OAO) strategy, a new multi-class kernel method based on directed aeyclie graph (DAG) and probabilistic distance is proposed to raise the multi-class classification accuracies. The topology structure of DAG is constructed by rearranging the nodes' sequence in the graph. DAG is equivalent to guided operating SVM on a list, and the classification performance depends on the nodes' sequence in the graph. Jeffries-Matusita distance (JMD) is introduced to estimate the separability of each class, and the implementation list is initialized with all classes organized according to certain sequence in the list. To testify the effectiveness of the proposed method, numerical analysis is conducted on UCI data and hyperspectral data. Meanwhile, comparative studies using standard OAO and DAG classification methods are also conducted and the results illustrate better performance and higher accuracy of the orooosed JMD-DAG method.展开更多
We measure the distance to the supernova remnant G15.4±0.1 which is likely associated with TeV source HESS J1818-154. We build the neutral hydrogen (HI) absorption and 13CO spectra for supernova remnant G 15.4&...We measure the distance to the supernova remnant G15.4±0.1 which is likely associated with TeV source HESS J1818-154. We build the neutral hydrogen (HI) absorption and 13CO spectra for supernova remnant G 15.4±0.1 by employing data from the Southern Galactic Plane Survey (SGPS) and the HI/OH/Recombination line survey (THOR). The maximum absorption velocity of about 140 km s-1 constrains the lower limit of its distance to about 8.0 kpc. Further, the fact that the HI emission feature at about 95 km s-1 seems to have no corresponding absorption suggests that G 15.4±0.1 likely has an upper limit for distance of about 10.5 kpc. The 13CO spectrum for the remnant supports our measurement. The new distance provides revised parameters on its associated pulsar wind nebula and TeV source.展开更多
In recent years, industrial and service organizations have invested in improvement projects with emphasis on increasing the performance of processes regarding to the production of manufactured goods and services, appl...In recent years, industrial and service organizations have invested in improvement projects with emphasis on increasing the performance of processes regarding to the production of manufactured goods and services, applying techniques to optimize production time in order to minimize the restrictive effects of the funds invested in processing or obtaining processes in order to reduce the losses of general scope. This paper discusses the impact of the innovation in making use of business intelligence (BI) concepts about production records of a publishing area in a higher education institution (HEI) that promotes distance education (DE) in Brazil, helping the industry in their managements decisions, having a target minimize time on production of learning material through more effective control with the use of cubes in the form of reports for metric queries of delivery delays and metrics on the production tasks financial values, and filtering the processed information so that managers can view information from various angles and managerial perspectives. The objectives of this paper are to demonstrate the impact on using BI concepts in the process of an editorial department of a HEI focusing on the development of teaching materials for the courses of DE and identify the financial cost-benefit ratio for the HEI with the deploying BI in a software fee platform in its publishing department. The sector is responsible by in courseware publishing organizations that usually do not have systems with this emphasis: support on making managerial decisions.展开更多
Detecting the boundaries of protein domains is an important and challenging task in both experimental and computational structural biology. In this paper, a promising method for detecting the domain structure of a pro...Detecting the boundaries of protein domains is an important and challenging task in both experimental and computational structural biology. In this paper, a promising method for detecting the domain structure of a protein from sequence information alone is presented. The method is based on analyzing multiple sequence alignments derived from a database search. Multiple measures are defined to quantify the domain information content of each position along the sequence. Then they are combined into a single predictor using support vector machine. What is more important, the domain detection is first taken as an imbal- anced data learning problem. A novel undersampling method is proposed on distance-based maximal entropy in the feature space of Support Vector Machine (SVM). The overall precision is about 80%. Simulation results demonstrate that the method can help not only in predicting the complete 3D structure of a protein but also in the machine learning system on general im- balanced datasets.展开更多
Compositional data, such as relative information, is a crucial aspect of machine learning and other related fields. It is typically recorded as closed data or sums to a constant, like 100%. The statistical linear mode...Compositional data, such as relative information, is a crucial aspect of machine learning and other related fields. It is typically recorded as closed data or sums to a constant, like 100%. The statistical linear model is the most used technique for identifying hidden relationships between underlying random variables of interest. However, data quality is a significant challenge in machine learning, especially when missing data is present. The linear regression model is a commonly used statistical modeling technique used in various applications to find relationships between variables of interest. When estimating linear regression parameters which are useful for things like future prediction and partial effects analysis of independent variables, maximum likelihood estimation (MLE) is the method of choice. However, many datasets contain missing observations, which can lead to costly and time-consuming data recovery. To address this issue, the expectation-maximization (EM) algorithm has been suggested as a solution for situations including missing data. The EM algorithm repeatedly finds the best estimates of parameters in statistical models that depend on variables or data that have not been observed. This is called maximum likelihood or maximum a posteriori (MAP). Using the present estimate as input, the expectation (E) step constructs a log-likelihood function. Finding the parameters that maximize the anticipated log-likelihood, as determined in the E step, is the job of the maximization (M) phase. This study looked at how well the EM algorithm worked on a made-up compositional dataset with missing observations. It used both the robust least square version and ordinary least square regression techniques. The efficacy of the EM algorithm was compared with two alternative imputation techniques, k-Nearest Neighbor (k-NN) and mean imputation (), in terms of Aitchison distances and covariance.展开更多
文摘Connes' distance formula is applied to endow linear metric to three 1D lattices of different topologies with a generalization of lattice Dirac operator written down by Dimakis et al.to contain a non-unitary link-variable.Geometric interpretation of this link-variable is lattice spacing and parallel transport.
文摘Data processing of small samples is an important and valuable research problem in the electronic equipment test. Because it is difficult and complex to determine the probability distribution of small samples, it is difficult to use the traditional probability theory to process the samples and assess the degree of uncertainty. Using the grey relational theory and the norm theory, the grey distance information approach, which is based on the grey distance information quantity of a sample and the average grey distance information quantity of the samples, is proposed in this article. The definitions of the grey distance information quantity of a sample and the average grey distance information quantity of the samples, with their characteristics and algorithms, are introduced. The correlative problems, including the algorithm of estimated value, the standard deviation, and the acceptance and rejection criteria of the samples and estimated results, are also proposed. Moreover, the information whitening ratio is introduced to select the weight algorithm and to compare the different samples. Several examples are given to demonstrate the application of the proposed approach. The examples show that the proposed approach, which has no demand for the probability distribution of small samples, is feasible and effective.
基金This work was supported by a National Research Foundation of Korea(NRF)grant funded by the Korea Government(MSIT)under Grant 2020R1A2B5B01002145.
文摘Machine Learning(ML)systems often involve a re-training process to make better predictions and classifications.This re-training process creates a loophole and poses a security threat for ML systems.Adversaries leverage this loophole and design data poisoning attacks against ML systems.Data poisoning attacks are a type of attack in which an adversary manipulates the training dataset to degrade the ML system’s performance.Data poisoning attacks are challenging to detect,and even more difficult to respond to,particularly in the Internet of Things(IoT)environment.To address this problem,we proposed DISTINIT,the first proactive data poisoning attack detection framework using distancemeasures.We found that Jaccard Distance(JD)can be used in the DISTINIT(among other distance measures)and we finally improved the JD to attain an Optimized JD(OJD)with lower time and space complexity.Our security analysis shows that the DISTINIT is secure against data poisoning attacks by considering key features of adversarial attacks.We conclude that the proposed OJD-based DISTINIT is effective and efficient against data poisoning attacks where in-time detection is critical for IoT applications with large volumes of streaming data.
基金funded by the MOE(Ministry of Education in China)Project of Humanities and Social Sciences(Project Number:18YJC870006)from China.
文摘Learning unlabeled data is a significant challenge that needs to han-dle complicated relationships between nominal values and attributes.Increas-ingly,recent research on learning value relations within and between attributes has shown significant improvement in clustering and outlier detection,etc.However,typical existing work relies on learning pairwise value relations but weakens or overlooks the direct couplings between multiple attributes.This paper thus proposes two novel and flexible multi-attribute couplings-based distance(MCD)metrics,which learn the multi-attribute couplings and their strengths in nominal data based on information theories:self-information,entropy,and mutual information,for measuring both numerical and nominal distances.MCD enables the application of numerical and nominal clustering methods on nominal data and quantifies the influence of involving and filtering multi-attribute couplings on distance learning and clustering perfor-mance.Substantial experiments evidence the above conclusions on 15 data sets against seven state-of-the-art distance measures with various feature selection methods for both numerical and nominal clustering.
文摘The 1970-1985 day to day averaged pressure dataset of Shanghai and the extension method in phase space are used to calculate the correlation dimension D and the second-order Renyi entropy K2 of the approximation of Kolmogorov's entropy, the fractional dimension D = 7.7-7.9 and the positive value K2 - 0.1 are obtained. This shows that the attractor for the short-term weather evolution in the monsoon region of China exhibits a chaotic motion. The estimate of K2 yields a predictable time scale of about ten days. This result is in agreement with that obtained earlier by the dynamic-statistical approach.The effects of the lag time i on the estimate of D and K2 are investigated. The results show that D and K2 are convergent with respect to i. The day to day averaged pressure series used in this paper are treated for the extensive phase space with T = 5, the coordinate components are independent of each other; therefore, the dynamical character quantities of the system are stable and reliable.
基金Project (No. 51476040103JW13) supported by the National DefenseKey Laboratory of Target and Environment Feature of China
文摘Based on the framework of evidence theory, data fusion aims at obtaining a single Basic Probability Assignment (BPA) function by combining several belief functions from distinct information sources. Dempster’s rule of combination is the most popular rule of combinations, but it is a poor solution for the management of the conflict between various information sources at the normalization step. Even when it faces high conflict information, the classical Dempster-Shafer’s (D-S) evidence theory can involve counter-intuitive results. This paper presents a modified averaging method to combine conflicting evidence based on the distance of evidences; and also gives the weighted average of the evidence in the system. Numerical examples showed that the proposed method can realize the modification ideas and also will provide reasonable results with good convergence efficiency.
基金supported in part by the National Key Research and Development Program of China under Grant No.2018YFC1507302in part by the National Natural Science Foundation of China under Grant No.41975028。
文摘A physical retrieval approach based on the one-dimensional variational(1 D-Var) algorithm is applied in this paper to simultaneously retrieve atmospheric temperature and humidity profiles under both clear-sky and partly cloudy conditions from FY-4 A GIIRS(geostationary interferometric infrared sounder) observations. Radiosonde observations from upper-air stations in China and level-2 operational products from the Chinese National Satellite Meteorological Center(NSMC)during the periods from December 2019 to January 2020(winter) and from July 2020 to August 2020(summer) are used to validate the accuracies of the retrieved temperature and humidity profiles. Comparing the 1 D-Var-retrieved profiles to radiosonde data, the accuracy of the temperature retrievals at each vertical level of the troposphere is characterized by a root mean square error(RMSE) within 2 K, except for at the bottom level of the atmosphere under clear conditions. The RMSE increases slightly for the higher atmospheric layers, owing to the lack of temperature sounding channels there.Under partly cloudy conditions, the temperature at each vertical level can be obtained, while the level-2 operational products obtain values only at altitudes above the cloud top. In addition, the accuracy of the retrieved temperature profiles is greatly improved compared with the accuracies of the operational products. For the humidity retrievals, the mean RMSEs in the troposphere in winter and summer are both within 2 g kg^(–1). Moreover, the retrievals performed better compared with the ERA5 reanalysis data between 800 h Pa and 300 h Pa both in summer and winter in terms of RMSE.
基金funded by the National Basic Research Program of China (2013CB429902)the National Natural Science Foundation of China (U1303181, 41671032)
文摘A knowledge of soil permeability is essential to evaluate hydrologic characteristics of soil, such as water storage and water movement, and soil permeability coefficient is an important parameter that reflects soil permeability. In order to confirm the acceptability of the one-dimensional horizontal infiltration method(one-D method) for simultaneously determining both the saturated and unsaturated permeability coefficients of loamy sand, we first measured the cumulative infiltration and the wetting front distance under various infiltration heads through a series of one-dimensional horizontal infiltration experiments, and then analyzed the relationships of the cumulative horizontal infiltration with the wetting front distance and the square root of infiltration time. We finally compared the permeability results from Gardner model based on the one-D method with the results from other two commonly-used methods(i.e., constant head method and van Genuchten model) to evaluate the acceptability and applicability of the one-D method. The results showed that there was a robust linear relationship between the cumulative horizontal infiltration and the wetting front distance, suggesting that it is more appropriate to take the soil moisture content after infiltration in the entire wetted zone as the average soil moisture content than as the saturated soil moisture content. The results also showed that there was a robust linear relationship between the cumulative horizontal infiltration and the square root of infiltration time, suggesting that the Philip infiltration formula can better reflect the characteristics of cumulative horizontal infiltration under different infiltration heads. The following two facts indicate that it is feasible to use the one-D method for simultaneously determining the saturated and unsaturated permeability coefficients of loamy sand. First, the saturated permeability coefficient(prescribed in the Gardner model) of loamy sand obtained from the one-D method well agreed with the value obtained from the constant head method. Second, the relationship of unsaturated permeability coefficient with soil water suction for loamy sand calculated using Gardner model based on the one-D method was nearly identical with the same relationship calculated using van Genuchten model.
文摘An extension of 2-D assignment approach is proposed for measurement-to-target association for improving multiple targets vector miss distance measurement accuracy. When the multiple targets move so closely, the measurements can not be fully resolved due to finite resolution. The proposed method adopts an auction algorithm to compute the feasible measurement-to-target assignment with unresolved measurements for solving this 2-D assignment problem. Computer simulation results demonstrate the effectiveness and feasibility of this method.
基金supported by the National Science and Technology Major Project (No. 2017ZX05001-003)。
文摘Seismic waveform clustering is a useful technique for lithologic identification and reservoir characterization.The current seismic waveform clustering algorithms are predominantly based on a fixed time window,which is applicable for layers of stable thickness.When a layer exhibits variable thickness in the seismic response,a fixed time window cannot provide comprehensive geologic information for the target interval.Therefore,we propose a novel approach for a waveform clustering workfl ow based on a variable time window to enable broader applications.The dynamic time warping(DTW)distance is fi rst introduced to effectively measure the similarities between seismic waveforms with various lengths.We develop a DTW distance-based clustering algorithm to extract centroids,and we then determine the class of all seismic traces according to the DTW distances from centroids.To greatly reduce the computational complexity in seismic data application,we propose a superpixel-based seismic data thinning approach.We further propose an integrated workfl ow that can be applied to practical seismic data by incorporating the DTW distance-based clustering and seismic data thinning algorithms.We evaluated the performance by applying the proposed workfl ow to synthetic seismograms and seismic survey data.Compared with the the traditional waveform clustering method,the synthetic seismogram results demonstrate the enhanced capability of the proposed workfl ow to detect boundaries of diff erent lithologies or lithologic associations with variable thickness.Results from a practical application show that the planar map of seismic waveform clustering obtained by the proposed workfl ow correlates well with the geological characteristics of wells in terms of reservoir thickness.
文摘There are numerous application areas of computing similarity between process models.It includes finding similar models from a repository,controlling redundancy of process models,and finding corresponding activities between a pair of process models.The similarity between two process models is computed based on their similarity between labels,structures,and execution behaviors.Several attempts have been made to develop similarity techniques between activity labels,as well as their execution behavior.However,a notable problem with the process model similarity is that two process models can also be similar if there is a structural variation between them.However,neither a benchmark dataset exists for the structural similarity between process models nor there exist an effective technique to compute structural similarity.To that end,we have developed a large collection of process models in which structural changes are handcrafted while preserving the semantics of the models.Furthermore,we have used a machine learning-based approach to compute the similarity between a pair of process models having structural and label differences.Finally,we have evaluated the proposed approach using our generated collection of process models.
基金supported by the Yunnan Major Scientific and Technological Projects(Grant No.202302AD080001)the National Natural Science Foundation,China(No.52065033).
文摘When building a classification model,the scenario where the samples of one class are significantly more than those of the other class is called data imbalance.Data imbalance causes the trained classification model to be in favor of the majority class(usually defined as the negative class),which may do harm to the accuracy of the minority class(usually defined as the positive class),and then lead to poor overall performance of the model.A method called MSHR-FCSSVM for solving imbalanced data classification is proposed in this article,which is based on a new hybrid resampling approach(MSHR)and a new fine cost-sensitive support vector machine(CS-SVM)classifier(FCSSVM).The MSHR measures the separability of each negative sample through its Silhouette value calculated by Mahalanobis distance between samples,based on which,the so-called pseudo-negative samples are screened out to generate new positive samples(over-sampling step)through linear interpolation and are deleted finally(under-sampling step).This approach replaces pseudo-negative samples with generated new positive samples one by one to clear up the inter-class overlap on the borderline,without changing the overall scale of the dataset.The FCSSVM is an improved version of the traditional CS-SVM.It considers influences of both the imbalance of sample number and the class distribution on classification simultaneously,and through finely tuning the class cost weights by using the efficient optimization algorithm based on the physical phenomenon of rime-ice(RIME)algorithm with cross-validation accuracy as the fitness function to accurately adjust the classification borderline.To verify the effectiveness of the proposed method,a series of experiments are carried out based on 20 imbalanced datasets including both mildly and extremely imbalanced datasets.The experimental results show that the MSHR-FCSSVM method performs better than the methods for comparison in most cases,and both the MSHR and the FCSSVM played significant roles.
文摘Most clustering algorithms need to describe the similarity of objects by a predefined distance function. Three distance functions which are widely used in two traditional clustering algorithms k-means and hierarchical clustering were investigated. Both theoretical analysis and detailed experimental results were given. It is shown that a distance function greatly affects clustering results and can be used to detect the outlier of a cluster by the comparison of such different results and give the shape information of clusters. In practice situation, it is suggested to use different distance function separately, compare the clustering results and pick out the 搒wing points? And such points may leak out more information for data analysts.
基金Sponsored by the National Natural Science Foundation of China (No.60573141, 60773041)National 863 High Tech- nology Research Program of China (No.2007AA01Z404, 2007AA01Z478)+2 种基金High Technology Research Programme of Jiangsu Province (No.BG2006001)Key Laboratory of Information Technology Processing of Jiangsu Province (kjs06006)Project of NJUPT (NY207135)
文摘Cache performance tuning tools are conducive to develop program with good locality and fully use cache to decrease the influence caused by speed gap between processor and memory. This paper introduces the design and implementation of a cache performance tuning tool named CTuning, which employs a source level instrumentation method to gather program data access information, and uses a limited reuse distance model to analyze cache behavior. Experiments on 183.equake improve average performance more than 6% and show that CTuning is proficient not only in locating cache performance bottlenecks to guide manual code transformation, but also in analyzing cache behavior relationship among variables, thus to direct manual data reorganization.
基金Sponsored by the National Natural Science Foundation of China(Grant No.61201310)the Fundamental Research Funds for the Central Universities(Grant No.HIT.NSRIF.201160)the China Postdoctoral Science Foundation(Grant No.20110491067)
文摘Based on the framework of support vector machines (SVM) using one-against-one (OAO) strategy, a new multi-class kernel method based on directed aeyclie graph (DAG) and probabilistic distance is proposed to raise the multi-class classification accuracies. The topology structure of DAG is constructed by rearranging the nodes' sequence in the graph. DAG is equivalent to guided operating SVM on a list, and the classification performance depends on the nodes' sequence in the graph. Jeffries-Matusita distance (JMD) is introduced to estimate the separability of each class, and the implementation list is initialized with all classes organized according to certain sequence in the list. To testify the effectiveness of the proposed method, numerical analysis is conducted on UCI data and hyperspectral data. Meanwhile, comparative studies using standard OAO and DAG classification methods are also conducted and the results illustrate better performance and higher accuracy of the orooosed JMD-DAG method.
基金support from the National Natural Science Foundation of China (11473038 and 11273025)
文摘We measure the distance to the supernova remnant G15.4±0.1 which is likely associated with TeV source HESS J1818-154. We build the neutral hydrogen (HI) absorption and 13CO spectra for supernova remnant G 15.4±0.1 by employing data from the Southern Galactic Plane Survey (SGPS) and the HI/OH/Recombination line survey (THOR). The maximum absorption velocity of about 140 km s-1 constrains the lower limit of its distance to about 8.0 kpc. Further, the fact that the HI emission feature at about 95 km s-1 seems to have no corresponding absorption suggests that G 15.4±0.1 likely has an upper limit for distance of about 10.5 kpc. The 13CO spectrum for the remnant supports our measurement. The new distance provides revised parameters on its associated pulsar wind nebula and TeV source.
文摘In recent years, industrial and service organizations have invested in improvement projects with emphasis on increasing the performance of processes regarding to the production of manufactured goods and services, applying techniques to optimize production time in order to minimize the restrictive effects of the funds invested in processing or obtaining processes in order to reduce the losses of general scope. This paper discusses the impact of the innovation in making use of business intelligence (BI) concepts about production records of a publishing area in a higher education institution (HEI) that promotes distance education (DE) in Brazil, helping the industry in their managements decisions, having a target minimize time on production of learning material through more effective control with the use of cubes in the form of reports for metric queries of delivery delays and metrics on the production tasks financial values, and filtering the processed information so that managers can view information from various angles and managerial perspectives. The objectives of this paper are to demonstrate the impact on using BI concepts in the process of an editorial department of a HEI focusing on the development of teaching materials for the courses of DE and identify the financial cost-benefit ratio for the HEI with the deploying BI in a software fee platform in its publishing department. The sector is responsible by in courseware publishing organizations that usually do not have systems with this emphasis: support on making managerial decisions.
基金National Natural Science Foundation of China (Grant No. 60433020, 60673099, 60673023)"985" project of Jilin University
文摘Detecting the boundaries of protein domains is an important and challenging task in both experimental and computational structural biology. In this paper, a promising method for detecting the domain structure of a protein from sequence information alone is presented. The method is based on analyzing multiple sequence alignments derived from a database search. Multiple measures are defined to quantify the domain information content of each position along the sequence. Then they are combined into a single predictor using support vector machine. What is more important, the domain detection is first taken as an imbal- anced data learning problem. A novel undersampling method is proposed on distance-based maximal entropy in the feature space of Support Vector Machine (SVM). The overall precision is about 80%. Simulation results demonstrate that the method can help not only in predicting the complete 3D structure of a protein but also in the machine learning system on general im- balanced datasets.
文摘Compositional data, such as relative information, is a crucial aspect of machine learning and other related fields. It is typically recorded as closed data or sums to a constant, like 100%. The statistical linear model is the most used technique for identifying hidden relationships between underlying random variables of interest. However, data quality is a significant challenge in machine learning, especially when missing data is present. The linear regression model is a commonly used statistical modeling technique used in various applications to find relationships between variables of interest. When estimating linear regression parameters which are useful for things like future prediction and partial effects analysis of independent variables, maximum likelihood estimation (MLE) is the method of choice. However, many datasets contain missing observations, which can lead to costly and time-consuming data recovery. To address this issue, the expectation-maximization (EM) algorithm has been suggested as a solution for situations including missing data. The EM algorithm repeatedly finds the best estimates of parameters in statistical models that depend on variables or data that have not been observed. This is called maximum likelihood or maximum a posteriori (MAP). Using the present estimate as input, the expectation (E) step constructs a log-likelihood function. Finding the parameters that maximize the anticipated log-likelihood, as determined in the E step, is the job of the maximization (M) phase. This study looked at how well the EM algorithm worked on a made-up compositional dataset with missing observations. It used both the robust least square version and ordinary least square regression techniques. The efficacy of the EM algorithm was compared with two alternative imputation techniques, k-Nearest Neighbor (k-NN) and mean imputation (), in terms of Aitchison distances and covariance.