The world of information technology is more than ever being flooded with huge amounts of data,nearly 2.5 quintillion bytes every day.This large stream of data is called big data,and the amount is increasing each day.T...The world of information technology is more than ever being flooded with huge amounts of data,nearly 2.5 quintillion bytes every day.This large stream of data is called big data,and the amount is increasing each day.This research uses a technique called sampling,which selects a representative subset of the data points,manipulates and analyzes this subset to identify patterns and trends in the larger dataset being examined,and finally,creates models.Sampling uses a small proportion of the original data for analysis and model training,so that it is relatively faster while maintaining data integrity and achieving accurate results.Two deep neural networks,AlexNet and DenseNet,were used in this research to test two sampling techniques,namely sampling with replacement and reservoir sampling.The dataset used for this research was divided into three classes:acceptable,flagged as easy,and flagged as hard.The base models were trained with the whole dataset,whereas the other models were trained on 50%of the original dataset.There were four combinations of model and sampling technique.The F-measure for the AlexNet model was 0.807 while that for the DenseNet model was 0.808.Combination 1 was the AlexNet model and sampling with replacement,achieving an average F-measure of 0.8852.Combination 3 was the AlexNet model and reservoir sampling.It had an average F-measure of 0.8545.Combination 2 was the DenseNet model and sampling with replacement,achieving an average F-measure of 0.8017.Finally,combination 4 was the DenseNet model and reservoir sampling.It had an average F-measure of 0.8111.Overall,we conclude that both models trained on a sampled dataset gave equal or better results compared to the base models,which used the whole dataset.展开更多
Material identification is critical for understanding the relationship between mechanical properties and the associated mechanical functions.However,material identification is a challenging task,especially when the ch...Material identification is critical for understanding the relationship between mechanical properties and the associated mechanical functions.However,material identification is a challenging task,especially when the characteristic of the material is highly nonlinear in nature,as is common in biological tissue.In this work,we identify unknown material properties in continuum solid mechanics via physics-informed neural networks(PINNs).To improve the accuracy and efficiency of PINNs,we develop efficient strategies to nonuniformly sample observational data.We also investigate different approaches to enforce Dirichlet-type boundary conditions(BCs)as soft or hard constraints.Finally,we apply the proposed methods to a diverse set of time-dependent and time-independent solid mechanic examples that span linear elastic and hyperelastic material space.The estimated material parameters achieve relative errors of less than 1%.As such,this work is relevant to diverse applications,including optimizing structural integrity and developing novel materials.展开更多
Identifying rare patterns for medical diagnosis is a challenging task due to heterogeneity and the volume of data.Data summarization can create a concise version of the original data that can be used for effective dia...Identifying rare patterns for medical diagnosis is a challenging task due to heterogeneity and the volume of data.Data summarization can create a concise version of the original data that can be used for effective diagnosis.In this paper,we propose an ensemble summarization method that combines clustering and sampling to create a summary of the original data to ensure the inclusion of rare patterns.To the best of our knowledge,there has been no such technique available to augment the performance of anomaly detection techniques and simultaneously increase the efficiency of medical diagnosis.The performance of popular anomaly detection algorithms increases significantly in terms of accuracy and computational complexity when the summaries are used.Therefore,the medical diagnosis becomes more effective,and our experimental results reflect that the combination of the proposed summarization scheme and all underlying algorithms used in this paper outperforms the most popular anomaly detection techniques.展开更多
Rhododendron is famous for its high ornamental value.However,the genus is taxonomically difficult and the relationships within Rhododendron remain unresolved.In addition,the origin of key morphological characters with...Rhododendron is famous for its high ornamental value.However,the genus is taxonomically difficult and the relationships within Rhododendron remain unresolved.In addition,the origin of key morphological characters with high horticulture value need to be explored.Both problems largely hinder utilization of germplasm resources.Most studies attempted to disentangle the phylogeny of Rhododendron,but only used a few genomic markers and lacked large-scale sampling,resulting in low clade support and contradictory phylogenetic signals.Here,we used restriction-site associated DNA sequencing(RAD-seq)data and morphological traits for 144 species of Rhododendron,representing all subgenera and most sections and subsections of this species-rich genus,to decipher its intricate evolutionary history and reconstruct ancestral state.Our results revealed high resolutions at subgenera and section levels of Rhododendron based on RAD-seq data.Both optimal phylogenetic tree and split tree recovered five lineages among Rhododendron.Subg.Therorhodion(cladeⅠ)formed the basal lineage.Subg.Tsutsusi and Azaleastrum formed cladeⅡand had sister relationships.CladeⅢincluded all scaly rhododendron species.Subg.Pentanthera(cladeⅣ)formed a sister group to Subg.Hymenanthes(cladeⅤ).The results of ancestral state reconstruction showed that Rhododendron ancestor was a deciduous woody plant with terminal inflorescence,ten stamens,leaf blade without scales and broadly funnelform corolla with pink or purple color.This study shows significant distinguishability to resolve the evolutionary history of Rhododendron based on high clade support of phylogenetic tree constructed by RAD-seq data.It also provides an example to resolve discordant signals in phylogenetic trees and demonstrates the application feasibility of RAD-seq with large amounts of missing data in deciphering intricate evolutionary relationships.Additionally,the reconstructed ancestral state of six important characters provides insights into the innovation of key characters in Rhododendron.展开更多
A novel data streams partitioning method is proposed to resolve problems of range-aggregation continuous queries over parallel streams for power industry.The first step of this method is to parallel sample the data,wh...A novel data streams partitioning method is proposed to resolve problems of range-aggregation continuous queries over parallel streams for power industry.The first step of this method is to parallel sample the data,which is implemented as an extended reservoir-sampling algorithm.A skip factor based on the change ratio of data-values is introduced to describe the distribution characteristics of data-values adaptively.The second step of this method is to partition the fluxes of data streams averagely,which is implemented with two alternative equal-depth histogram generating algorithms that fit the different cases:one for incremental maintenance based on heuristics and the other for periodical updates to generate an approximate partition vector.The experimental results on actual data prove that the method is efficient,practical and suitable for time-varying data streams processing.展开更多
In order to improve the precision of super point detection and control measurement resource consumption, this paper proposes a super point detection method based on sampling and data streaming algorithms (SDSD), and...In order to improve the precision of super point detection and control measurement resource consumption, this paper proposes a super point detection method based on sampling and data streaming algorithms (SDSD), and proves that only sources or destinations with a lot of flows can be sampled probabilistically using the SDSD algorithm. The SDSD algorithm uses both the IP table and the flow bloom filter (BF) data structures to maintain the IP and flow information. The IP table is used to judge whether an IP address has been recorded. If the IP exists, then all its subsequent flows will be recorded into the flow BF; otherwise, the IP flow is sampled. This paper also analyzes the accuracy and memory requirements of the SDSD algorithm , and tests them using the CERNET trace. The theoretical analysis and experimental tests demonstrate that the most relative errors of the super points estimated by the SDSD algorithm are less than 5%, whereas the results of other algorithms are about 10%. Because of the BF structure, the SDSD algorithm is also better than previous algorithms in terms of memory consumption.展开更多
Southwest China is one of three major forest regions in China and plays an important role in carbon sequestration.Accurate estimations of changes in aboveground biomass are critical for understanding forest carbon cyc...Southwest China is one of three major forest regions in China and plays an important role in carbon sequestration.Accurate estimations of changes in aboveground biomass are critical for understanding forest carbon cycling and promoting climate change mitigation.Southwest China is characterized by complex topographic features and forest canopy structures,complicating methods for mapping aboveground biomass and its dynamics.The integration of continuous Landsat images and national forest inventory data provides an alternative approach to develop a long-term monitoring program of forest aboveground biomass dynamics.This study explores the development of a methodological framework using historical national forest inventory plot data and Landsat TM timeseries images.This method was formulated by comparing two parametric methods:Linear Regression for Multiple Independent Variables(MLR),and Partial Least Square Regression(PLSR);and two nonparametric methods:Random Forest(RF)and Gradient Boost Regression Tree(GBRT)based on the state of forest aboveground biomass and change models.The methodological framework mapped Pinus densata aboveground biomass and its changes over time in Shangri-la,Yunnan,China.Landsat images and national forest inventory data were acquired for 1987,1992,1997,2002 and 2007.The results show that:(1)correlation and homogeneity texture measures were able to characterize forest canopy structures,aboveground biomass and its dynamics;(2)GBRT and RF predicted Pinus densata aboveground biomass and its changes better than PLSR and MLR;(3)GBRT was the most reliable approach in the estimation of aboveground biomass and its changes;and,(4)the aboveground biomass change models showed a promising improvement of prediction accuracy.This study indicates that the combination of GBRT state and change models developed using temporal Landsat and national forest inventory data provides the potential for developing a methodological framework for the long-term mapping and monitoring program of forest aboveground biomass and its changes in Southwest China.展开更多
Clustering, in data mining, is a useful technique for discovering interesting data distributions and patterns in the underlying data, and has many application fields, such as statistical data analysis, pattern recogni...Clustering, in data mining, is a useful technique for discovering interesting data distributions and patterns in the underlying data, and has many application fields, such as statistical data analysis, pattern recognition, image processing, and etc. We combine sampling technique with DBSCAN algorithm to cluster large spatial databases, and two sampling based DBSCAN (SDBSCAN) algorithms are developed. One algorithm introduces sampling technique inside DBSCAN, and the other uses sampling procedure outside DBSCAN. Experimental results demonstrate that our algorithms are effective and efficient in clustering large scale spatial databases.展开更多
China's continental deposition basins are characterized by complex geological structures and various reservoir lithologies. Therefore, high precision exploration methods are needed. High density spatial sampling is a...China's continental deposition basins are characterized by complex geological structures and various reservoir lithologies. Therefore, high precision exploration methods are needed. High density spatial sampling is a new technology to increase the accuracy of seismic exploration. We briefly discuss point source and receiver technology, analyze the high density spatial sampling in situ method, introduce the symmetric sampling principles presented by Gijs J. O. Vermeer, and discuss high density spatial sampling technology from the point of view of wave field continuity. We emphasize the analysis of the high density spatial sampling characteristics, including the high density first break advantages for investigation of near surface structure, improving static correction precision, the use of dense receiver spacing at short offsets to increase the effective coverage at shallow depth, and the accuracy of reflection imaging. Coherent noise is not aliased and the noise analysis precision and suppression increases as a result. High density spatial sampling enhances wave field continuity and the accuracy of various mathematical transforms, which benefits wave field separation. Finally, we point out that the difficult part of high density spatial sampling technology is the data processing. More research needs to be done on the methods of analyzing and processing huge amounts of seismic data.展开更多
For imbalanced datasets, the focus of classification is to identify samples of the minority class. The performance of current data mining algorithms is not good enough for processing imbalanced datasets. The synthetic...For imbalanced datasets, the focus of classification is to identify samples of the minority class. The performance of current data mining algorithms is not good enough for processing imbalanced datasets. The synthetic minority over-sampling technique(SMOTE) is specifically designed for learning from imbalanced datasets, generating synthetic minority class examples by interpolating between minority class examples nearby. However, the SMOTE encounters the overgeneralization problem. The densitybased spatial clustering of applications with noise(DBSCAN) is not rigorous when dealing with the samples near the borderline.We optimize the DBSCAN algorithm for this problem to make clustering more reasonable. This paper integrates the optimized DBSCAN and SMOTE, and proposes a density-based synthetic minority over-sampling technique(DSMOTE). First, the optimized DBSCAN is used to divide the samples of the minority class into three groups, including core samples, borderline samples and noise samples, and then the noise samples of minority class is removed to synthesize more effective samples. In order to make full use of the information of core samples and borderline samples,different strategies are used to over-sample core samples and borderline samples. Experiments show that DSMOTE can achieve better results compared with SMOTE and Borderline-SMOTE in terms of precision, recall and F-value.展开更多
The capability of accurately predicting mineralogical brittleness index (BI) from basic suites of well logs is desirable as it provides a useful indicator of the fracability of tight formations.Measuring mineralogical...The capability of accurately predicting mineralogical brittleness index (BI) from basic suites of well logs is desirable as it provides a useful indicator of the fracability of tight formations.Measuring mineralogical components in rocks is expensive and time consuming.However,the basic well log curves are not well correlated with BI so correlation-based,machine-learning methods are not able to derive highly accurate BI predictions using such data.A correlation-free,optimized data-matching algorithm is configured to predict BI on a supervised basis from well log and core data available from two published wells in the Lower Barnett Shale Formation (Texas).This transparent open box (TOB) algorithm matches data records by calculating the sum of squared errors between their variables and selecting the best matches as those with the minimum squared errors.It then applies optimizers to adjust weights applied to individual variable errors to minimize the root mean square error (RMSE)between calculated and predicted (BI).The prediction accuracy achieved by TOB using just five well logs (Gr,ρb,Ns,Rs,Dt) to predict BI is dependent on the density of data records sampled.At a sampling density of about one sample per 0.5 ft BI is predicted with RMSE~0.056 and R^(2)~0.790.At a sampling density of about one sample per0.1 ft BI is predicted with RMSE~0.008 and R^(2)~0.995.Adding a stratigraphic height index as an additional (sixth)input variable method improves BI prediction accuracy to RMSE~0.003 and R^(2)~0.999 for the two wells with only 1 record in 10,000 yielding a BI prediction error of>±0.1.The model has the potential to be applied in an unsupervised basis to predict BI from basic well log data in surrounding wells lacking mineralogical measurements but with similar lithofacies and burial histories.The method could also be extended to predict elastic rock properties in and seismic attributes from wells and seismic data to improve the precision of brittleness index and fracability mapping spatially.展开更多
Objective To develop methods for determining a suitable sample size for bioequivalence assessment of generic topical ophthalmic drugs using crossover design with serial sampling schemes.Methods The power functions of ...Objective To develop methods for determining a suitable sample size for bioequivalence assessment of generic topical ophthalmic drugs using crossover design with serial sampling schemes.Methods The power functions of the Fieller-type confidence interval and the asymptotic confidence interval in crossover designs with serial-sampling data are here derived.Simulation studies were conducted to evaluate the derived power functions.Results Simulation studies show that two power functions can provide precise power estimates when normality assumptions are satisfied and yield conservative estimates of power in cases when data are log-normally distributed.The intra-correlation showed a positive correlation with the power of the bioequivalence test.When the expected ratio of the AUCs was less than or equal to 1, the power of the Fieller-type confidence interval was larger than the asymptotic confidence interval.If the expected ratio of the AUCs was larger than 1, the asymptotic confidence interval had greater power.Sample size can be calculated through numerical iteration with the derived power functions.Conclusion The Fieller-type power function and the asymptotic power function can be used to determine sample sizes of crossover trials for bioequivalence assessment of topical ophthalmic drugs.展开更多
This paper is concerned with a novel Lyapunovlike functional approach to the stability of sampled-data systems with variable sampling periods. The Lyapunov-like functional has four striking characters compared to usua...This paper is concerned with a novel Lyapunovlike functional approach to the stability of sampled-data systems with variable sampling periods. The Lyapunov-like functional has four striking characters compared to usual ones. First, it is time-dependent. Second, it may be discontinuous. Third, not every term of it is required to be positive definite. Fourth, the Lyapunov functional includes not only the state and the sampled state but also the integral of the state. By using a recently reported inequality to estimate the derivative of this Lyapunov functional, a sampled-interval-dependent stability criterion with reduced conservatism is obtained. The stability criterion is further extended to sampled-data systems with polytopic uncertainties. Finally, three examples are given to illustrate the reduced conservatism of the stability criteria.展开更多
The research was carried out on the territory of the Karelian Isthmus of the Leningrad Region using Sentinel-2B images and data from a network of ground sample plots. The ground sample plots are located in the studied...The research was carried out on the territory of the Karelian Isthmus of the Leningrad Region using Sentinel-2B images and data from a network of ground sample plots. The ground sample plots are located in the studied territory mainly in a regular manner, laid and surveyed according to the ICP-Forests methodology with some additions. The total area of the sample plots is a small part of the entire study area. One of the objectives of the study was to determine the possibility of using the k-NN (nearest neighbor method) to assess the state of forests throughout the whole studied territory by joint statistical processing of data from ground sample plots and Sentinel-2B imagery. The data of the ground-based sample plots were divided into 2 equal parts, one for the application of the k-NN method, the second for checking the results of the method application. The systematic error in determining the mean damage class of the tree stands on sample plots by the k-NN method turned out to be zero, the random error is equal to one point. These results offer a possibility to determine the state of the forest in the entire study area. The second objective of the study was to examine the possibility of using the short-wave vegetation index (SWVI) to assess the state of forests. As a result, a close statistically reliable dependence of the average score of the state of plantations and the value of the SWVI index was established, which makes it possible to use the established relationship to determine the state of forests throughout the studied territory. The joint use and statistical processing of remotely sensed data and ground-based test areas by the two studied methods make it possible to assess the state of forests throughout the large studied area within the image. The results obtained can be used to monitor the state of forests in large areas and design appropriate forestry protective measures.展开更多
Fourier transform is a basis of the analysis. This paper presents a kind ofmethod of minimum sampling data determined profile of the inverted object ininverse scattering.
In this paper, consensus problems of heterogeneous multi-agent systems based on sampled data with a small sampling delay are considered. First, a consensus protocol based on sampled data with a small sampling delay fo...In this paper, consensus problems of heterogeneous multi-agent systems based on sampled data with a small sampling delay are considered. First, a consensus protocol based on sampled data with a small sampling delay for heterogeneous multi-agent systems is proposed. Then, the algebra graph theory, the matrix method, the stability theory of linear systems, and some other techniques are employed to derive the necessary and sufficient conditions guaranteeing heterogeneous multi-agent systems to asymptotically achieve the stationary consensus. Finally, simulations are performed to demonstrate the correctness of the theoretical results.展开更多
A new three-parameter discrete distribution called the zero-inflated cosine geometric(ZICG)distribution is proposed for the first time herein.It can be used to analyze over-dispersed count data with excess zeros.The b...A new three-parameter discrete distribution called the zero-inflated cosine geometric(ZICG)distribution is proposed for the first time herein.It can be used to analyze over-dispersed count data with excess zeros.The basic statistical properties of the new distribution,such as the moment generating function,mean,and variance are presented.Furthermore,confidence intervals are constructed by using the Wald,Bayesian,and highest posterior density(HPD)methods to estimate the true confidence intervals for the parameters of the ZICG distribution.Their efficacies were investigated by using both simulation and real-world data comprising the number of daily COVID-19 positive cases at the Olympic Games in Tokyo 2020.The results show that the HPD interval performed better than the other methods in terms of coverage probability and average length in most cases studied.展开更多
The potential of citizen science projects in research has been increasingly acknowledged,but the substantial engagement of these projects is restricted by the quality of citizen science data.Based on the largest emerg...The potential of citizen science projects in research has been increasingly acknowledged,but the substantial engagement of these projects is restricted by the quality of citizen science data.Based on the largest emerging citizen science project in the country-Birdreport Online Database(BOD),we examined the biases of birdwatching data from the Greater Bay Area of China.The results show that the sampling effort is disparate among land cover types due to contributors’ preference towards urban and suburban areas,indicating the environment suitable for species existence could be underrepresented in the BOD data.We tested the contributors’ skill of species identification via a questionnaire targeting the citizen birders in the Greater Bay Area.The questionnaire show that most citizen birdwatchers could correctly identify the common species widely distributed in Southern China and the less common species with conspicuous morphological characteristics,while failed to identify the species from Alaudidae;Caprimulgidae,Emberizidae,Phylloscopidae,Scolopacidae and Scotocercidae.With a study example,we demonstrate that spatially clustered bird watching visits can cause underestimation of species richness in insufficiently sampled areas;and the result of species richness mapping is sensitive to the contributors’ skill of identifying bird species.Our results address how avian research can be influenced by the reliability of citizen science data in a region of generally high accessibility,and highlight the necessity of pre-analysis scrutiny on data reliability regarding to research aims at all spatial and temporal scales.To improve the data quality,we suggest to equip the data collection frame of BOD with a flexible filter for bird abundance,and questionnaires that collect information related to contributors’ bird identification skill.Statistic modelling approaches are encouraged to apply for correcting the bias of sampling effort.展开更多
文摘The world of information technology is more than ever being flooded with huge amounts of data,nearly 2.5 quintillion bytes every day.This large stream of data is called big data,and the amount is increasing each day.This research uses a technique called sampling,which selects a representative subset of the data points,manipulates and analyzes this subset to identify patterns and trends in the larger dataset being examined,and finally,creates models.Sampling uses a small proportion of the original data for analysis and model training,so that it is relatively faster while maintaining data integrity and achieving accurate results.Two deep neural networks,AlexNet and DenseNet,were used in this research to test two sampling techniques,namely sampling with replacement and reservoir sampling.The dataset used for this research was divided into three classes:acceptable,flagged as easy,and flagged as hard.The base models were trained with the whole dataset,whereas the other models were trained on 50%of the original dataset.There were four combinations of model and sampling technique.The F-measure for the AlexNet model was 0.807 while that for the DenseNet model was 0.808.Combination 1 was the AlexNet model and sampling with replacement,achieving an average F-measure of 0.8852.Combination 3 was the AlexNet model and reservoir sampling.It had an average F-measure of 0.8545.Combination 2 was the DenseNet model and sampling with replacement,achieving an average F-measure of 0.8017.Finally,combination 4 was the DenseNet model and reservoir sampling.It had an average F-measure of 0.8111.Overall,we conclude that both models trained on a sampled dataset gave equal or better results compared to the base models,which used the whole dataset.
基金funded by the Cora Topolewski Cardiac Research Fund at the Children’s Hospital of Philadelphia(CHOP)the Pediatric Valve Center Frontier Program at CHOP+4 种基金the Additional Ventures Single Ventricle Research Fund Expansion Awardthe National Institutes of Health(USA)supported by the program(Nos.NHLBI T32 HL007915 and NIH R01 HL153166)supported by the program(No.NIH R01 HL153166)supported by the U.S.Department of Energy(No.DE-SC0022953)。
文摘Material identification is critical for understanding the relationship between mechanical properties and the associated mechanical functions.However,material identification is a challenging task,especially when the characteristic of the material is highly nonlinear in nature,as is common in biological tissue.In this work,we identify unknown material properties in continuum solid mechanics via physics-informed neural networks(PINNs).To improve the accuracy and efficiency of PINNs,we develop efficient strategies to nonuniformly sample observational data.We also investigate different approaches to enforce Dirichlet-type boundary conditions(BCs)as soft or hard constraints.Finally,we apply the proposed methods to a diverse set of time-dependent and time-independent solid mechanic examples that span linear elastic and hyperelastic material space.The estimated material parameters achieve relative errors of less than 1%.As such,this work is relevant to diverse applications,including optimizing structural integrity and developing novel materials.
文摘Identifying rare patterns for medical diagnosis is a challenging task due to heterogeneity and the volume of data.Data summarization can create a concise version of the original data that can be used for effective diagnosis.In this paper,we propose an ensemble summarization method that combines clustering and sampling to create a summary of the original data to ensure the inclusion of rare patterns.To the best of our knowledge,there has been no such technique available to augment the performance of anomaly detection techniques and simultaneously increase the efficiency of medical diagnosis.The performance of popular anomaly detection algorithms increases significantly in terms of accuracy and computational complexity when the summaries are used.Therefore,the medical diagnosis becomes more effective,and our experimental results reflect that the combination of the proposed summarization scheme and all underlying algorithms used in this paper outperforms the most popular anomaly detection techniques.
基金supported by Ten Thousand Talent Program of Yunnan Province(Grant No.YNWR-QNBJ-2018-174)the Key Basic Research Program of Yunnan Province,China(Grant No.202101BC070003)+3 种基金National Natural Science Foundation of China(Grant No.31901237)Conservation Program for Plant Species with Extremely Small Populations in Yunnan Province(Grant No.2022SJ07X-03)Key Technologies Research for the Germplasmof Important Woody Flowers in Yunnan Province(Grant No.202302AE090018)Natural Science Foundation of Guizhou Province(Grant No.Qiankehejichu-ZK2021yiban 089&Qiankehejichu-ZK2023yiban 035)。
文摘Rhododendron is famous for its high ornamental value.However,the genus is taxonomically difficult and the relationships within Rhododendron remain unresolved.In addition,the origin of key morphological characters with high horticulture value need to be explored.Both problems largely hinder utilization of germplasm resources.Most studies attempted to disentangle the phylogeny of Rhododendron,but only used a few genomic markers and lacked large-scale sampling,resulting in low clade support and contradictory phylogenetic signals.Here,we used restriction-site associated DNA sequencing(RAD-seq)data and morphological traits for 144 species of Rhododendron,representing all subgenera and most sections and subsections of this species-rich genus,to decipher its intricate evolutionary history and reconstruct ancestral state.Our results revealed high resolutions at subgenera and section levels of Rhododendron based on RAD-seq data.Both optimal phylogenetic tree and split tree recovered five lineages among Rhododendron.Subg.Therorhodion(cladeⅠ)formed the basal lineage.Subg.Tsutsusi and Azaleastrum formed cladeⅡand had sister relationships.CladeⅢincluded all scaly rhododendron species.Subg.Pentanthera(cladeⅣ)formed a sister group to Subg.Hymenanthes(cladeⅤ).The results of ancestral state reconstruction showed that Rhododendron ancestor was a deciduous woody plant with terminal inflorescence,ten stamens,leaf blade without scales and broadly funnelform corolla with pink or purple color.This study shows significant distinguishability to resolve the evolutionary history of Rhododendron based on high clade support of phylogenetic tree constructed by RAD-seq data.It also provides an example to resolve discordant signals in phylogenetic trees and demonstrates the application feasibility of RAD-seq with large amounts of missing data in deciphering intricate evolutionary relationships.Additionally,the reconstructed ancestral state of six important characters provides insights into the innovation of key characters in Rhododendron.
基金The High Technology Research Plan of Jiangsu Prov-ince (No.BG2004034)the Foundation of Graduate Creative Program ofJiangsu Province (No.xm04-36).
文摘A novel data streams partitioning method is proposed to resolve problems of range-aggregation continuous queries over parallel streams for power industry.The first step of this method is to parallel sample the data,which is implemented as an extended reservoir-sampling algorithm.A skip factor based on the change ratio of data-values is introduced to describe the distribution characteristics of data-values adaptively.The second step of this method is to partition the fluxes of data streams averagely,which is implemented with two alternative equal-depth histogram generating algorithms that fit the different cases:one for incremental maintenance based on heuristics and the other for periodical updates to generate an approximate partition vector.The experimental results on actual data prove that the method is efficient,practical and suitable for time-varying data streams processing.
基金The National Basic Research Program of China(973Program)(No.2009CB320505)the Natural Science Foundation of Jiangsu Province(No. BK2008288)+1 种基金the Excellent Young Teachers Program of Southeast University(No.4009001018)the Open Research Program of Key Laboratory of Computer Network of Guangdong Province (No. CCNL200706)
文摘In order to improve the precision of super point detection and control measurement resource consumption, this paper proposes a super point detection method based on sampling and data streaming algorithms (SDSD), and proves that only sources or destinations with a lot of flows can be sampled probabilistically using the SDSD algorithm. The SDSD algorithm uses both the IP table and the flow bloom filter (BF) data structures to maintain the IP and flow information. The IP table is used to judge whether an IP address has been recorded. If the IP exists, then all its subsequent flows will be recorded into the flow BF; otherwise, the IP flow is sampled. This paper also analyzes the accuracy and memory requirements of the SDSD algorithm , and tests them using the CERNET trace. The theoretical analysis and experimental tests demonstrate that the most relative errors of the super points estimated by the SDSD algorithm are less than 5%, whereas the results of other algorithms are about 10%. Because of the BF structure, the SDSD algorithm is also better than previous algorithms in terms of memory consumption.
基金supported by the State Forestry Administration of China under the national forestry commonwealth project grant#201404309the Expert Workstation of Academician Tang Shouzheng of Yunnan Province,the Yunnan provincial key project of Forestrythe Research Center of Kunming Forestry Information Engineering Technology
文摘Southwest China is one of three major forest regions in China and plays an important role in carbon sequestration.Accurate estimations of changes in aboveground biomass are critical for understanding forest carbon cycling and promoting climate change mitigation.Southwest China is characterized by complex topographic features and forest canopy structures,complicating methods for mapping aboveground biomass and its dynamics.The integration of continuous Landsat images and national forest inventory data provides an alternative approach to develop a long-term monitoring program of forest aboveground biomass dynamics.This study explores the development of a methodological framework using historical national forest inventory plot data and Landsat TM timeseries images.This method was formulated by comparing two parametric methods:Linear Regression for Multiple Independent Variables(MLR),and Partial Least Square Regression(PLSR);and two nonparametric methods:Random Forest(RF)and Gradient Boost Regression Tree(GBRT)based on the state of forest aboveground biomass and change models.The methodological framework mapped Pinus densata aboveground biomass and its changes over time in Shangri-la,Yunnan,China.Landsat images and national forest inventory data were acquired for 1987,1992,1997,2002 and 2007.The results show that:(1)correlation and homogeneity texture measures were able to characterize forest canopy structures,aboveground biomass and its dynamics;(2)GBRT and RF predicted Pinus densata aboveground biomass and its changes better than PLSR and MLR;(3)GBRT was the most reliable approach in the estimation of aboveground biomass and its changes;and,(4)the aboveground biomass change models showed a promising improvement of prediction accuracy.This study indicates that the combination of GBRT state and change models developed using temporal Landsat and national forest inventory data provides the potential for developing a methodological framework for the long-term mapping and monitoring program of forest aboveground biomass and its changes in Southwest China.
基金Supported by the Open Researches Fund Program of L IESMARS(WKL(0 0 ) 0 30 2 )
文摘Clustering, in data mining, is a useful technique for discovering interesting data distributions and patterns in the underlying data, and has many application fields, such as statistical data analysis, pattern recognition, image processing, and etc. We combine sampling technique with DBSCAN algorithm to cluster large spatial databases, and two sampling based DBSCAN (SDBSCAN) algorithms are developed. One algorithm introduces sampling technique inside DBSCAN, and the other uses sampling procedure outside DBSCAN. Experimental results demonstrate that our algorithms are effective and efficient in clustering large scale spatial databases.
文摘China's continental deposition basins are characterized by complex geological structures and various reservoir lithologies. Therefore, high precision exploration methods are needed. High density spatial sampling is a new technology to increase the accuracy of seismic exploration. We briefly discuss point source and receiver technology, analyze the high density spatial sampling in situ method, introduce the symmetric sampling principles presented by Gijs J. O. Vermeer, and discuss high density spatial sampling technology from the point of view of wave field continuity. We emphasize the analysis of the high density spatial sampling characteristics, including the high density first break advantages for investigation of near surface structure, improving static correction precision, the use of dense receiver spacing at short offsets to increase the effective coverage at shallow depth, and the accuracy of reflection imaging. Coherent noise is not aliased and the noise analysis precision and suppression increases as a result. High density spatial sampling enhances wave field continuity and the accuracy of various mathematical transforms, which benefits wave field separation. Finally, we point out that the difficult part of high density spatial sampling technology is the data processing. More research needs to be done on the methods of analyzing and processing huge amounts of seismic data.
基金supported by the National Key Research and Development Program of China(2018YFB1003700)the Scientific and Technological Support Project(Society)of Jiangsu Province(BE2016776)+2 种基金the“333” project of Jiangsu Province(BRA2017228 BRA2017401)the Talent Project in Six Fields of Jiangsu Province(2015-JNHB-012)
文摘For imbalanced datasets, the focus of classification is to identify samples of the minority class. The performance of current data mining algorithms is not good enough for processing imbalanced datasets. The synthetic minority over-sampling technique(SMOTE) is specifically designed for learning from imbalanced datasets, generating synthetic minority class examples by interpolating between minority class examples nearby. However, the SMOTE encounters the overgeneralization problem. The densitybased spatial clustering of applications with noise(DBSCAN) is not rigorous when dealing with the samples near the borderline.We optimize the DBSCAN algorithm for this problem to make clustering more reasonable. This paper integrates the optimized DBSCAN and SMOTE, and proposes a density-based synthetic minority over-sampling technique(DSMOTE). First, the optimized DBSCAN is used to divide the samples of the minority class into three groups, including core samples, borderline samples and noise samples, and then the noise samples of minority class is removed to synthesize more effective samples. In order to make full use of the information of core samples and borderline samples,different strategies are used to over-sample core samples and borderline samples. Experiments show that DSMOTE can achieve better results compared with SMOTE and Borderline-SMOTE in terms of precision, recall and F-value.
文摘The capability of accurately predicting mineralogical brittleness index (BI) from basic suites of well logs is desirable as it provides a useful indicator of the fracability of tight formations.Measuring mineralogical components in rocks is expensive and time consuming.However,the basic well log curves are not well correlated with BI so correlation-based,machine-learning methods are not able to derive highly accurate BI predictions using such data.A correlation-free,optimized data-matching algorithm is configured to predict BI on a supervised basis from well log and core data available from two published wells in the Lower Barnett Shale Formation (Texas).This transparent open box (TOB) algorithm matches data records by calculating the sum of squared errors between their variables and selecting the best matches as those with the minimum squared errors.It then applies optimizers to adjust weights applied to individual variable errors to minimize the root mean square error (RMSE)between calculated and predicted (BI).The prediction accuracy achieved by TOB using just five well logs (Gr,ρb,Ns,Rs,Dt) to predict BI is dependent on the density of data records sampled.At a sampling density of about one sample per 0.5 ft BI is predicted with RMSE~0.056 and R^(2)~0.790.At a sampling density of about one sample per0.1 ft BI is predicted with RMSE~0.008 and R^(2)~0.995.Adding a stratigraphic height index as an additional (sixth)input variable method improves BI prediction accuracy to RMSE~0.003 and R^(2)~0.999 for the two wells with only 1 record in 10,000 yielding a BI prediction error of>±0.1.The model has the potential to be applied in an unsupervised basis to predict BI from basic well log data in surrounding wells lacking mineralogical measurements but with similar lithofacies and burial histories.The method could also be extended to predict elastic rock properties in and seismic attributes from wells and seismic data to improve the precision of brittleness index and fracability mapping spatially.
基金supported by sub-project of National Major Scientific and Technological Special Project of China for ‘Significant New Drugs Development’[2015ZX09501008-004]
文摘Objective To develop methods for determining a suitable sample size for bioequivalence assessment of generic topical ophthalmic drugs using crossover design with serial sampling schemes.Methods The power functions of the Fieller-type confidence interval and the asymptotic confidence interval in crossover designs with serial-sampling data are here derived.Simulation studies were conducted to evaluate the derived power functions.Results Simulation studies show that two power functions can provide precise power estimates when normality assumptions are satisfied and yield conservative estimates of power in cases when data are log-normally distributed.The intra-correlation showed a positive correlation with the power of the bioequivalence test.When the expected ratio of the AUCs was less than or equal to 1, the power of the Fieller-type confidence interval was larger than the asymptotic confidence interval.If the expected ratio of the AUCs was larger than 1, the asymptotic confidence interval had greater power.Sample size can be calculated through numerical iteration with the derived power functions.Conclusion The Fieller-type power function and the asymptotic power function can be used to determine sample sizes of crossover trials for bioequivalence assessment of topical ophthalmic drugs.
基金supported by the National Natural Science Foundation of China(61374090)the Program for Scientific Research Innovation Team in Colleges and Universities of Shandong Provincethe Taishan Scholarship Project of Shandong Province
文摘This paper is concerned with a novel Lyapunovlike functional approach to the stability of sampled-data systems with variable sampling periods. The Lyapunov-like functional has four striking characters compared to usual ones. First, it is time-dependent. Second, it may be discontinuous. Third, not every term of it is required to be positive definite. Fourth, the Lyapunov functional includes not only the state and the sampled state but also the integral of the state. By using a recently reported inequality to estimate the derivative of this Lyapunov functional, a sampled-interval-dependent stability criterion with reduced conservatism is obtained. The stability criterion is further extended to sampled-data systems with polytopic uncertainties. Finally, three examples are given to illustrate the reduced conservatism of the stability criteria.
文摘The research was carried out on the territory of the Karelian Isthmus of the Leningrad Region using Sentinel-2B images and data from a network of ground sample plots. The ground sample plots are located in the studied territory mainly in a regular manner, laid and surveyed according to the ICP-Forests methodology with some additions. The total area of the sample plots is a small part of the entire study area. One of the objectives of the study was to determine the possibility of using the k-NN (nearest neighbor method) to assess the state of forests throughout the whole studied territory by joint statistical processing of data from ground sample plots and Sentinel-2B imagery. The data of the ground-based sample plots were divided into 2 equal parts, one for the application of the k-NN method, the second for checking the results of the method application. The systematic error in determining the mean damage class of the tree stands on sample plots by the k-NN method turned out to be zero, the random error is equal to one point. These results offer a possibility to determine the state of the forest in the entire study area. The second objective of the study was to examine the possibility of using the short-wave vegetation index (SWVI) to assess the state of forests. As a result, a close statistically reliable dependence of the average score of the state of plantations and the value of the SWVI index was established, which makes it possible to use the established relationship to determine the state of forests throughout the studied territory. The joint use and statistical processing of remotely sensed data and ground-based test areas by the two studied methods make it possible to assess the state of forests throughout the large studied area within the image. The results obtained can be used to monitor the state of forests in large areas and design appropriate forestry protective measures.
文摘Fourier transform is a basis of the analysis. This paper presents a kind ofmethod of minimum sampling data determined profile of the inverted object ininverse scattering.
基金Project supported by the National Natural Science Foundation of China(Grant Nos.61203147,61374047,61203126,and 61104092)the Humanities and Social Sciences Youth Funds of the Ministry of Education,China(Grant No.12YJCZH218)
文摘In this paper, consensus problems of heterogeneous multi-agent systems based on sampled data with a small sampling delay are considered. First, a consensus protocol based on sampled data with a small sampling delay for heterogeneous multi-agent systems is proposed. Then, the algebra graph theory, the matrix method, the stability theory of linear systems, and some other techniques are employed to derive the necessary and sufficient conditions guaranteeing heterogeneous multi-agent systems to asymptotically achieve the stationary consensus. Finally, simulations are performed to demonstrate the correctness of the theoretical results.
基金support from the National Science,Research and Innovation Fund (NSRF)King Mongkut’s University of Technology North Bangkok (Grant No.KMUTNB-FF-65-22).
文摘A new three-parameter discrete distribution called the zero-inflated cosine geometric(ZICG)distribution is proposed for the first time herein.It can be used to analyze over-dispersed count data with excess zeros.The basic statistical properties of the new distribution,such as the moment generating function,mean,and variance are presented.Furthermore,confidence intervals are constructed by using the Wald,Bayesian,and highest posterior density(HPD)methods to estimate the true confidence intervals for the parameters of the ZICG distribution.Their efficacies were investigated by using both simulation and real-world data comprising the number of daily COVID-19 positive cases at the Olympic Games in Tokyo 2020.The results show that the HPD interval performed better than the other methods in terms of coverage probability and average length in most cases studied.
基金the Estuary wetland wildlife survey project of the Greater Bay Area of China(Science and Technology Planning Projects of Guangdong Province,2021B1212110002).
文摘The potential of citizen science projects in research has been increasingly acknowledged,but the substantial engagement of these projects is restricted by the quality of citizen science data.Based on the largest emerging citizen science project in the country-Birdreport Online Database(BOD),we examined the biases of birdwatching data from the Greater Bay Area of China.The results show that the sampling effort is disparate among land cover types due to contributors’ preference towards urban and suburban areas,indicating the environment suitable for species existence could be underrepresented in the BOD data.We tested the contributors’ skill of species identification via a questionnaire targeting the citizen birders in the Greater Bay Area.The questionnaire show that most citizen birdwatchers could correctly identify the common species widely distributed in Southern China and the less common species with conspicuous morphological characteristics,while failed to identify the species from Alaudidae;Caprimulgidae,Emberizidae,Phylloscopidae,Scolopacidae and Scotocercidae.With a study example,we demonstrate that spatially clustered bird watching visits can cause underestimation of species richness in insufficiently sampled areas;and the result of species richness mapping is sensitive to the contributors’ skill of identifying bird species.Our results address how avian research can be influenced by the reliability of citizen science data in a region of generally high accessibility,and highlight the necessity of pre-analysis scrutiny on data reliability regarding to research aims at all spatial and temporal scales.To improve the data quality,we suggest to equip the data collection frame of BOD with a flexible filter for bird abundance,and questionnaires that collect information related to contributors’ bird identification skill.Statistic modelling approaches are encouraged to apply for correcting the bias of sampling effort.