The main goal of this research is to assess the impact of race, age at diagnosis, sex, and phenotype on the incidence and survivability of acute lymphocytic leukemia (ALL) among patients in the United States. By takin...The main goal of this research is to assess the impact of race, age at diagnosis, sex, and phenotype on the incidence and survivability of acute lymphocytic leukemia (ALL) among patients in the United States. By taking these factors into account, the study aims to explore how existing cancer registry data can aid in the early detection and effective treatment of ALL in patients. Our hypothesis was that statistically significant correlations exist between race, age at which patients were diagnosed, sex, and phenotype of the ALL patients, and their rate of incidence and survivability data were evaluated using SEER*Stat statistical software from National Cancer Institute. Analysis of the incidence data revealed that a higher prevalence of ALL was among the Caucasian population. The majority of ALL cases (59%) occurred in patients aged between 0 to 19 years at the time of diagnosis, and 56% of the affected individuals were male. The B-cell phenotype was predominantly associated with ALL cases (73%). When analyzing survivability data, it was observed that the 5-year survival rates slightly exceeded the 10-year survival rates for the respective demographics. Survivability rates of African Americans patients were the lowest compared to Caucasian, Asian, Pacific Islanders, Alaskan Native, Native Americans and others. Survivability rates progressively decreased for older patients. Moreover, this study investigated the typical treatment methods applied to ALL patients, mainly comprising chemotherapy, with occasional supplementation of radiation therapy as required. The study demonstrated the considerable efficacy of chemotherapy in enhancing patients’ chances of survival, while those who remained untreated faced a less favorable prognosis from the disease. Although a significant amount of data and information exists, this study can help doctors in the future by diagnosing patients with certain characteristics. It will further assist the health care professionals in screening potential patients and early detection of cases. This could also save the lives of elderly patients who have a higher mortality rate from this disease.展开更多
Recently,much interest has been given tomulti-granulation rough sets (MGRS), and various types ofMGRSmodelshave been developed from different viewpoints. In this paper, we introduce two techniques for the classificati...Recently,much interest has been given tomulti-granulation rough sets (MGRS), and various types ofMGRSmodelshave been developed from different viewpoints. In this paper, we introduce two techniques for the classificationof MGRS. Firstly, we generate multi-topologies from multi-relations defined in the universe. Hence, a novelapproximation space is established by leveraging the underlying topological structure. The characteristics of thenewly proposed approximation space are discussed.We introduce an algorithmfor the reduction ofmulti-relations.Secondly, a new approach for the classification ofMGRS based on neighborhood concepts is introduced. Finally, areal-life application from medical records is introduced via our approach to the classification of MGRS.展开更多
To improve question answering (QA) performance based on real-world web data sets,a new set of question classes and a general answer re-ranking model are defined.With pre-defined dictionary and grammatical analysis,t...To improve question answering (QA) performance based on real-world web data sets,a new set of question classes and a general answer re-ranking model are defined.With pre-defined dictionary and grammatical analysis,the question classifier draws both semantic and grammatical information into information retrieval and machine learning methods in the form of various training features,including the question word,the main verb of the question,the dependency structure,the position of the main auxiliary verb,the main noun of the question,the top hypernym of the main noun,etc.Then the QA query results are re-ranked by question class information.Experiments show that the questions in real-world web data sets can be accurately classified by the classifier,and the QA results after re-ranking can be obviously improved.It is proved that with both semantic and grammatical information,applications such as QA, built upon real-world web data sets, can be improved,thus showing better performance.展开更多
As for the satellite remote sensing data obtained by the visible and infrared bands myers,on, the clouds coverage in the sky over the ocean often results in missing data of inversion products on a large scale, and thi...As for the satellite remote sensing data obtained by the visible and infrared bands myers,on, the clouds coverage in the sky over the ocean often results in missing data of inversion products on a large scale, and thin clouds difficult to be detected would cause the data of the inversion products to be abnormal. Alvera et a1.(2005) proposed a method for the reconstruction of missing data based on an Empirical Orthogonal Functions (EOF) decomposition, but his method couldn't process these images presenting extreme cloud coverage(more than 95%), and required a long time for recon- struction. Besides, the abnormal data in the images had a great effect on the reconstruction result. Therefore, this paper tries to improve the study result. It has reconstructed missing data sets by twice applying EOF decomposition method. Firstly, the abnormity time has been detected by analyzing the temporal modes of EOF decomposition, and the abnormal data have been eliminated. Secondly, the data sets, excluding the abnormal data, are analyzed by using EOF decomposition, and then the temporal modes undergo a filtering process so as to enhance the ability of reconstruct- ing the images which are of no or just a little data, by using EOF. At last, this method has been applied to a large data set, i.e. 43 Sea Surface Temperature (SST) satellite images of the Changjiang River (Yangtze River) estuary and its adjacent areas, and the total reconstruction root mean square error (RMSE) is 0.82℃. And it has been proved that this improved EOF reconstruction method is robust for reconstructing satellite missing data and unreliable data.展开更多
In this paper, we consider the problem of the evaluation of system reliability using statistical data obtained from reliability tests of its elements, in which the lifetimes of elements are described using an exponent...In this paper, we consider the problem of the evaluation of system reliability using statistical data obtained from reliability tests of its elements, in which the lifetimes of elements are described using an exponential distribution. We assume that this lifetime data may be reported imprecisely and that this lack of precision may be described using fuzzy sets. As the direct application of the fuzzy sets methodology leads in this case to very complicated and time consuming calculations, we propose simple approximations of fuzzy numbers using shadowed sets introduced by Pedrycz (1998). The proposed methodology may be simply extended to the case of general lifetime probability distributions.展开更多
A novel binary particle swarm optimization for frequent item sets mining from high-dimensional dataset(BPSO-HD) was proposed, where two improvements were joined. Firstly, the dimensionality reduction of initial partic...A novel binary particle swarm optimization for frequent item sets mining from high-dimensional dataset(BPSO-HD) was proposed, where two improvements were joined. Firstly, the dimensionality reduction of initial particles was designed to ensure the reasonable initial fitness, and then, the dynamically dimensionality cutting of dataset was built to decrease the search space. Based on four high-dimensional datasets, BPSO-HD was compared with Apriori to test its reliability, and was compared with the ordinary BPSO and quantum swarm evolutionary(QSE) to prove its advantages. The experiments show that the results given by BPSO-HD is reliable and better than the results generated by BPSO and QSE.展开更多
This paper presents a generalized method for updating approximations of a concept incrementally, which can be used as an effective tool to deal with dynamic attribute generalization. By combining this method and the L...This paper presents a generalized method for updating approximations of a concept incrementally, which can be used as an effective tool to deal with dynamic attribute generalization. By combining this method and the LERS inductive learning algorithm, it also introduces a generalized quasi incremental algorithm for learning classification rules from data bases.展开更多
To evaluate the influence of data set noise, the network in network(NIN) model is introduced and the negative effects of different types and proportions of noise on deep convolutional models are studied. Different typ...To evaluate the influence of data set noise, the network in network(NIN) model is introduced and the negative effects of different types and proportions of noise on deep convolutional models are studied. Different types and proportions of data noise are added to two reference data sets, Cifar-10 and Cifar-100. Then, this data containing noise is used to train deep convolutional models and classify the validation data set. The experimental results show that the noise in the data set has obvious adverse effects on deep convolutional network classification models. The adverse effects of random noise are small, but the cross-category noise among categories can significantly reduce the recognition ability of the model. Therefore, a solution is proposed to improve the quality of the data sets that are mixed into a single noise category. The model trained with a data set containing noise is used to evaluate the current training data and reclassify the categories of the anomalies to form a new data set. Repeating the above steps can greatly reduce the noise ratio, so the influence of cross-category noise can be effectively avoided.展开更多
Raw data are classified using clustering techniques in a reasonable manner to create disjoint clusters.A lot of clustering algorithms based on specific parameters have been proposed to access a high volume of datasets...Raw data are classified using clustering techniques in a reasonable manner to create disjoint clusters.A lot of clustering algorithms based on specific parameters have been proposed to access a high volume of datasets.This paper focuses on cluster analysis based on neutrosophic set implication,i.e.,a k-means algorithm with a threshold-based clustering technique.This algorithm addresses the shortcomings of the k-means clustering algorithm by overcoming the limitations of the threshold-based clustering algorithm.To evaluate the validity of the proposed method,several validity measures and validity indices are applied to the Iris dataset(from the University of California,Irvine,Machine Learning Repository)along with k-means and threshold-based clustering algorithms.The proposed method results in more segregated datasets with compacted clusters,thus achieving higher validity indices.The method also eliminates the limitations of threshold-based clustering algorithm and validates measures and respective indices along with k-means and threshold-based clustering algorithms.展开更多
The Chaoshan depression,a Mesozoic basin in the Dongsha sea area,northern South China Sea,is characterized by well-preserved Mesozoic strata,being good conditions for oil-gas preservation,promising good prospects for ...The Chaoshan depression,a Mesozoic basin in the Dongsha sea area,northern South China Sea,is characterized by well-preserved Mesozoic strata,being good conditions for oil-gas preservation,promising good prospects for oil-gas exploration.However,breakthrough in oil-gas exploration in the Mesozoic strata has not been achieved due to less seismic surveys.New long-off set seismic data were processed that acquired with dense grid with single source and single cable.In addition,the data were processed with 3D imaging method and fi ner processing was performed to highlight the target strata.Combining the new imaging result and other geological information,we conducted integrated interpretation and proposed an exploratory well A-1-1 for potential hydrocarbon.The result provides a reliable basis for achieving breakthroughs in oil and gas exploration in the Mesozoic strata in the northern South China Sea.展开更多
Data mining (also known as Knowledge Discovery in Databases - KDD) is defined as the nontrivial extraction of implicit, previously unknown, and potentially useful information from data. The aims and objectives of data...Data mining (also known as Knowledge Discovery in Databases - KDD) is defined as the nontrivial extraction of implicit, previously unknown, and potentially useful information from data. The aims and objectives of data mining are to discover knowledge of interest to user needs.Data mining is really a useful tool in many domains such as marketing, decision making, etc. However, some basic issues of data mining are ignored. What is data mining? What is the product of a data mining process? What are we doing in a data mining process? Is there any rule we should obey in a data mining process? In order to discover patterns and knowledge really interesting and actionable to the real world Zhang et al proposed a domain-driven human-machine-cooperated data mining process.Zhao and Yao proposed an interactive user-driven classification method using the granule network. In our work, we find that data mining is a kind of knowledge transforming process to transform knowledge from data format into symbol format. Thus, no new knowledge could be generated (born) in a data mining process. In a data mining process, knowledge is just transformed from data format, which is not understandable for human, into symbol format,which is understandable for human and easy to be used.It is similar to the process of translating a book from Chinese into English.In this translating process,the knowledge itself in the book should remain unchanged. What will be changed is the format of the knowledge only. That is, the knowledge in the English book should be kept the same as the knowledge in the Chinese one.Otherwise, there must be some mistakes in the translating proces, that is, we are transforming knowledge from one format into another format while not producing new knowledge in a data mining process. The knowledge is originally stored in data (data is a representation format of knowledge). Unfortunately, we can not read, understand, or use it, since we can not understand data. With this understanding of data mining, we proposed a data-driven knowledge acquisition method based on rough sets. It also improved the performance of classical knowledge acquisition methods. In fact, we also find that the domain-driven data mining and user-driven data mining do not conflict with our data-driven data mining. They could be integrated into domain-oriented data-driven data mining. It is just like the views of data base. Users with different views could look at different partial data of a data base. Thus, users with different tasks or objectives wish, or could discover different knowledge (partial knowledge) from the same data base. However, all these partial knowledge should be originally existed in the data base. So, a domain-oriented data-driven data mining method would help us to extract the knowledge which is really existed in a data base, and really interesting and actionable to the real world.展开更多
An attempt of applying a novel genetic programming(GP) technique,a new member of evolution algorithms,has been made to predict the water storage of Wolonghu wetland response to the climate change in northeastern part ...An attempt of applying a novel genetic programming(GP) technique,a new member of evolution algorithms,has been made to predict the water storage of Wolonghu wetland response to the climate change in northeastern part of China with little data set.Fourteen years(1993-2006) of annual water storage and climatic data set of the wetland were taken for model training and testing.The results of simulations and predictions illustrated a good fit between calculated water storage and observed values(MAPE=9.47,r=0.99).By comparison,a multilayer perceptron(MLP)(a popular artificial neural network model) method and a grey model(GM) with the same data set were applied for performances estimation.It was found that GP technique had better performances than the other two methods both in the simulation step and predicting phase and the results were analyzed and discussed.The case study confirmed that GP method is a promising way for wetland managers to make a quick estimation of fluctuations of water storage in some wetlands under condition of little data set.展开更多
In this paper,we build a remote-sensing satellite imagery priori-information data set,and propose an approach to evaluate the robustness of remote-sensing image feature detectors.The building TH Priori-Information(TPI...In this paper,we build a remote-sensing satellite imagery priori-information data set,and propose an approach to evaluate the robustness of remote-sensing image feature detectors.The building TH Priori-Information(TPI)data set with 2297 remote sensing images serves as a standardized high-resolution data set for studies related to remote-sensing image features.The TPI contains 1)raw and calibrated remote-sensing images with high spatial and temporal resolutions(up to 2 m and 7 days,respectively),and 2)a built-in 3-D target area model that supports view position,view angle,lighting,shadowing,and other transformations.Based on TPI,we further present a quantized approach,including the feature recurrence rate,the feature match score,and the weighted feature robustness score,to evaluate the robustness of remote-sensing image feature detectors.The quantized approach gives general and objective assessments of the robustness of feature detectors under complex remote-sensing circumstances.Three remote-sensing image feature detectors,including scale-invariant feature transform(SIFT),speeded up robust features(SURF),and priori information based robust features(PIRF),are evaluated using the proposed approach on the TPI data set.Experimental results show that the robustness of PIRF outperforms others by over 6.2%.展开更多
In gene prediction, the Fisher discriminant analysis (FDA) is used to separate protein coding region (exon) from non-coding regions (intron). Usually, the positive data set and the negative data set are of the same si...In gene prediction, the Fisher discriminant analysis (FDA) is used to separate protein coding region (exon) from non-coding regions (intron). Usually, the positive data set and the negative data set are of the same size if the number of the data is big enough. But for some situations the data are not sufficient or not equal, the threshold used in FDA may have important influence on prediction results. This paper presents a study on the selection of the threshold. The eigen value of each exon/intron sequence is computed using the Z-curve method with 69 variables. The experiments results suggest that the size and the standard deviation of the data sets and the threshold are the three key elements to be taken into consideration to improve the prediction results.展开更多
Arctic region is experiencing strong warming and related changes in the state of sea ice, permafrost, tundra, marine environment and terrestrial ecosystems. These changes are found in any climatological data set compr...Arctic region is experiencing strong warming and related changes in the state of sea ice, permafrost, tundra, marine environment and terrestrial ecosystems. These changes are found in any climatological data set comprising the Arctic region. This study compares the temperature trends in several surface, satellite and reanalysis data sets. We demonstrate large differences in the 1979-2002 temperature trends. Data sets disagree on the magnitude of the trends as well as on their seasonal, zonal and vertical pattern. It was found that the surface temperature trends are stronger than the trends in the tropospheric temperature for each latitude band north of 50?N for each month except for the months during the ice-melting season. These results emphasize that the conclusions of climate studies drawn on the basis of a single data set analysis should be treated with caution as they may be affected by the artificial biases in data.展开更多
Rough set (RS) and radial basis function neural network (RBFNN) based insulation data mining fault diagnosis for power transformer is proposed. On the one hand rough set is used as front of RBFNN to simplify the input...Rough set (RS) and radial basis function neural network (RBFNN) based insulation data mining fault diagnosis for power transformer is proposed. On the one hand rough set is used as front of RBFNN to simplify the input of RBFNN and mine the rules. The mined rules whose “confidence” and “support” is higher than requirement are used to offer fault diagnosis service for power transformer directly. On the other hand the mining samples corresponding to the mined rule, whose “confidence and support” is lower than requirement, are used to be training samples set of RBFNN and these samples are clustered by rough set. The center of each clustering set is used to be center of radial basis function, i.e., as the hidden layer neuron. The RBFNN is structured with above base, which is used to diagnose the case that can not be diagnosed by mined simplified valuable rules based on rough set. The advantages and effectiveness of this method are verified by testing.展开更多
Rough set theory is relativly new to area of soft computing to handle the uncertain big data efficiently. It also provides a powerful way to calculate the importance degree of vague and uncertain big data to help in d...Rough set theory is relativly new to area of soft computing to handle the uncertain big data efficiently. It also provides a powerful way to calculate the importance degree of vague and uncertain big data to help in decision making. Risk assessment is very important for safe and reliable investment. Risk management involves assessing the risk sources and designing strategies and procedures to mitigate those risks to an acceptable level. In this paper, we emphasize on classification of different types of risk factors and find a simple and effective way to calculate the risk exposure.. The study uses rough set method to classify and judge the safety attributes related to investment policy. The method which based on intelligent knowledge accusation provides an innovative way for risk analysis. From this approach, we are able to calculate the significance of each factor and relative risk exposure based on the original data without assigning the weight subjectively.展开更多
文摘The main goal of this research is to assess the impact of race, age at diagnosis, sex, and phenotype on the incidence and survivability of acute lymphocytic leukemia (ALL) among patients in the United States. By taking these factors into account, the study aims to explore how existing cancer registry data can aid in the early detection and effective treatment of ALL in patients. Our hypothesis was that statistically significant correlations exist between race, age at which patients were diagnosed, sex, and phenotype of the ALL patients, and their rate of incidence and survivability data were evaluated using SEER*Stat statistical software from National Cancer Institute. Analysis of the incidence data revealed that a higher prevalence of ALL was among the Caucasian population. The majority of ALL cases (59%) occurred in patients aged between 0 to 19 years at the time of diagnosis, and 56% of the affected individuals were male. The B-cell phenotype was predominantly associated with ALL cases (73%). When analyzing survivability data, it was observed that the 5-year survival rates slightly exceeded the 10-year survival rates for the respective demographics. Survivability rates of African Americans patients were the lowest compared to Caucasian, Asian, Pacific Islanders, Alaskan Native, Native Americans and others. Survivability rates progressively decreased for older patients. Moreover, this study investigated the typical treatment methods applied to ALL patients, mainly comprising chemotherapy, with occasional supplementation of radiation therapy as required. The study demonstrated the considerable efficacy of chemotherapy in enhancing patients’ chances of survival, while those who remained untreated faced a less favorable prognosis from the disease. Although a significant amount of data and information exists, this study can help doctors in the future by diagnosing patients with certain characteristics. It will further assist the health care professionals in screening potential patients and early detection of cases. This could also save the lives of elderly patients who have a higher mortality rate from this disease.
文摘Recently,much interest has been given tomulti-granulation rough sets (MGRS), and various types ofMGRSmodelshave been developed from different viewpoints. In this paper, we introduce two techniques for the classificationof MGRS. Firstly, we generate multi-topologies from multi-relations defined in the universe. Hence, a novelapproximation space is established by leveraging the underlying topological structure. The characteristics of thenewly proposed approximation space are discussed.We introduce an algorithmfor the reduction ofmulti-relations.Secondly, a new approach for the classification ofMGRS based on neighborhood concepts is introduced. Finally, areal-life application from medical records is introduced via our approach to the classification of MGRS.
基金Microsoft Research Asia Internet Services in Academic Research Fund(No.FY07-RES-OPP-116)the Science and Technology Development Program of Tianjin(No.06YFGZGX05900)
文摘To improve question answering (QA) performance based on real-world web data sets,a new set of question classes and a general answer re-ranking model are defined.With pre-defined dictionary and grammatical analysis,the question classifier draws both semantic and grammatical information into information retrieval and machine learning methods in the form of various training features,including the question word,the main verb of the question,the dependency structure,the position of the main auxiliary verb,the main noun of the question,the top hypernym of the main noun,etc.Then the QA query results are re-ranked by question class information.Experiments show that the questions in real-world web data sets can be accurately classified by the classifier,and the QA results after re-ranking can be obviously improved.It is proved that with both semantic and grammatical information,applications such as QA, built upon real-world web data sets, can be improved,thus showing better performance.
基金The National Natural Science Foundation of China under contract Nos 40576080 and 40506036 the National"863" Project of China under contract No 2007AA12Z182
文摘As for the satellite remote sensing data obtained by the visible and infrared bands myers,on, the clouds coverage in the sky over the ocean often results in missing data of inversion products on a large scale, and thin clouds difficult to be detected would cause the data of the inversion products to be abnormal. Alvera et a1.(2005) proposed a method for the reconstruction of missing data based on an Empirical Orthogonal Functions (EOF) decomposition, but his method couldn't process these images presenting extreme cloud coverage(more than 95%), and required a long time for recon- struction. Besides, the abnormal data in the images had a great effect on the reconstruction result. Therefore, this paper tries to improve the study result. It has reconstructed missing data sets by twice applying EOF decomposition method. Firstly, the abnormity time has been detected by analyzing the temporal modes of EOF decomposition, and the abnormal data have been eliminated. Secondly, the data sets, excluding the abnormal data, are analyzed by using EOF decomposition, and then the temporal modes undergo a filtering process so as to enhance the ability of reconstruct- ing the images which are of no or just a little data, by using EOF. At last, this method has been applied to a large data set, i.e. 43 Sea Surface Temperature (SST) satellite images of the Changjiang River (Yangtze River) estuary and its adjacent areas, and the total reconstruction root mean square error (RMSE) is 0.82℃. And it has been proved that this improved EOF reconstruction method is robust for reconstructing satellite missing data and unreliable data.
文摘In this paper, we consider the problem of the evaluation of system reliability using statistical data obtained from reliability tests of its elements, in which the lifetimes of elements are described using an exponential distribution. We assume that this lifetime data may be reported imprecisely and that this lack of precision may be described using fuzzy sets. As the direct application of the fuzzy sets methodology leads in this case to very complicated and time consuming calculations, we propose simple approximations of fuzzy numbers using shadowed sets introduced by Pedrycz (1998). The proposed methodology may be simply extended to the case of general lifetime probability distributions.
文摘A novel binary particle swarm optimization for frequent item sets mining from high-dimensional dataset(BPSO-HD) was proposed, where two improvements were joined. Firstly, the dimensionality reduction of initial particles was designed to ensure the reasonable initial fitness, and then, the dynamically dimensionality cutting of dataset was built to decrease the search space. Based on four high-dimensional datasets, BPSO-HD was compared with Apriori to test its reliability, and was compared with the ordinary BPSO and quantum swarm evolutionary(QSE) to prove its advantages. The experiments show that the results given by BPSO-HD is reliable and better than the results generated by BPSO and QSE.
文摘This paper presents a generalized method for updating approximations of a concept incrementally, which can be used as an effective tool to deal with dynamic attribute generalization. By combining this method and the LERS inductive learning algorithm, it also introduces a generalized quasi incremental algorithm for learning classification rules from data bases.
基金The Science and Technology R&D Fund Project of Shenzhen(No.JCYJ2017081765149850)
文摘To evaluate the influence of data set noise, the network in network(NIN) model is introduced and the negative effects of different types and proportions of noise on deep convolutional models are studied. Different types and proportions of data noise are added to two reference data sets, Cifar-10 and Cifar-100. Then, this data containing noise is used to train deep convolutional models and classify the validation data set. The experimental results show that the noise in the data set has obvious adverse effects on deep convolutional network classification models. The adverse effects of random noise are small, but the cross-category noise among categories can significantly reduce the recognition ability of the model. Therefore, a solution is proposed to improve the quality of the data sets that are mixed into a single noise category. The model trained with a data set containing noise is used to evaluate the current training data and reclassify the categories of the anomalies to form a new data set. Repeating the above steps can greatly reduce the noise ratio, so the influence of cross-category noise can be effectively avoided.
文摘Raw data are classified using clustering techniques in a reasonable manner to create disjoint clusters.A lot of clustering algorithms based on specific parameters have been proposed to access a high volume of datasets.This paper focuses on cluster analysis based on neutrosophic set implication,i.e.,a k-means algorithm with a threshold-based clustering technique.This algorithm addresses the shortcomings of the k-means clustering algorithm by overcoming the limitations of the threshold-based clustering algorithm.To evaluate the validity of the proposed method,several validity measures and validity indices are applied to the Iris dataset(from the University of California,Irvine,Machine Learning Repository)along with k-means and threshold-based clustering algorithms.The proposed method results in more segregated datasets with compacted clusters,thus achieving higher validity indices.The method also eliminates the limitations of threshold-based clustering algorithm and validates measures and respective indices along with k-means and threshold-based clustering algorithms.
基金Supported by the Key Special Project for Introduced Talents Team of Southern Marine Science and Engineering Guangdong Laboratory(Guangzhou)(No.GML2019ZD0208)the National Natural Science Foundation of China(No.41606030)+1 种基金the Science and Technology Program of Guangzhou(No.202102080363)the China Geological Survey projects(Nos.DD20190212,DD20190216)。
文摘The Chaoshan depression,a Mesozoic basin in the Dongsha sea area,northern South China Sea,is characterized by well-preserved Mesozoic strata,being good conditions for oil-gas preservation,promising good prospects for oil-gas exploration.However,breakthrough in oil-gas exploration in the Mesozoic strata has not been achieved due to less seismic surveys.New long-off set seismic data were processed that acquired with dense grid with single source and single cable.In addition,the data were processed with 3D imaging method and fi ner processing was performed to highlight the target strata.Combining the new imaging result and other geological information,we conducted integrated interpretation and proposed an exploratory well A-1-1 for potential hydrocarbon.The result provides a reliable basis for achieving breakthroughs in oil and gas exploration in the Mesozoic strata in the northern South China Sea.
文摘Data mining (also known as Knowledge Discovery in Databases - KDD) is defined as the nontrivial extraction of implicit, previously unknown, and potentially useful information from data. The aims and objectives of data mining are to discover knowledge of interest to user needs.Data mining is really a useful tool in many domains such as marketing, decision making, etc. However, some basic issues of data mining are ignored. What is data mining? What is the product of a data mining process? What are we doing in a data mining process? Is there any rule we should obey in a data mining process? In order to discover patterns and knowledge really interesting and actionable to the real world Zhang et al proposed a domain-driven human-machine-cooperated data mining process.Zhao and Yao proposed an interactive user-driven classification method using the granule network. In our work, we find that data mining is a kind of knowledge transforming process to transform knowledge from data format into symbol format. Thus, no new knowledge could be generated (born) in a data mining process. In a data mining process, knowledge is just transformed from data format, which is not understandable for human, into symbol format,which is understandable for human and easy to be used.It is similar to the process of translating a book from Chinese into English.In this translating process,the knowledge itself in the book should remain unchanged. What will be changed is the format of the knowledge only. That is, the knowledge in the English book should be kept the same as the knowledge in the Chinese one.Otherwise, there must be some mistakes in the translating proces, that is, we are transforming knowledge from one format into another format while not producing new knowledge in a data mining process. The knowledge is originally stored in data (data is a representation format of knowledge). Unfortunately, we can not read, understand, or use it, since we can not understand data. With this understanding of data mining, we proposed a data-driven knowledge acquisition method based on rough sets. It also improved the performance of classical knowledge acquisition methods. In fact, we also find that the domain-driven data mining and user-driven data mining do not conflict with our data-driven data mining. They could be integrated into domain-oriented data-driven data mining. It is just like the views of data base. Users with different views could look at different partial data of a data base. Thus, users with different tasks or objectives wish, or could discover different knowledge (partial knowledge) from the same data base. However, all these partial knowledge should be originally existed in the data base. So, a domain-oriented data-driven data mining method would help us to extract the knowledge which is really existed in a data base, and really interesting and actionable to the real world.
基金Supported by National Natural Science Foundation of China(60675039)National High Technology Research and Development Program of China(863 Program)(2006AA04Z217)Hundred Talents Program of Chinese Academy of Sciences
基金Sponsored by the National Basic Research Program of China(Grant No. 2006CB403302)the National Education Ministry foundation of China(Grant No.705011)the National Special Science and Technology Program Water Pollution Control and Treatment (Grant No.2009ZX07526-006,2008AX07208-001)
文摘An attempt of applying a novel genetic programming(GP) technique,a new member of evolution algorithms,has been made to predict the water storage of Wolonghu wetland response to the climate change in northeastern part of China with little data set.Fourteen years(1993-2006) of annual water storage and climatic data set of the wetland were taken for model training and testing.The results of simulations and predictions illustrated a good fit between calculated water storage and observed values(MAPE=9.47,r=0.99).By comparison,a multilayer perceptron(MLP)(a popular artificial neural network model) method and a grey model(GM) with the same data set were applied for performances estimation.It was found that GP technique had better performances than the other two methods both in the simulation step and predicting phase and the results were analyzed and discussed.The case study confirmed that GP method is a promising way for wetland managers to make a quick estimation of fluctuations of water storage in some wetlands under condition of little data set.
基金the National Key Research and Development Program of China under Grant 2018YFF0301205in part by the National Natural Science Foundation of China under Grant NSFC 61925105 and Grant 61801260.
文摘In this paper,we build a remote-sensing satellite imagery priori-information data set,and propose an approach to evaluate the robustness of remote-sensing image feature detectors.The building TH Priori-Information(TPI)data set with 2297 remote sensing images serves as a standardized high-resolution data set for studies related to remote-sensing image features.The TPI contains 1)raw and calibrated remote-sensing images with high spatial and temporal resolutions(up to 2 m and 7 days,respectively),and 2)a built-in 3-D target area model that supports view position,view angle,lighting,shadowing,and other transformations.Based on TPI,we further present a quantized approach,including the feature recurrence rate,the feature match score,and the weighted feature robustness score,to evaluate the robustness of remote-sensing image feature detectors.The quantized approach gives general and objective assessments of the robustness of feature detectors under complex remote-sensing circumstances.Three remote-sensing image feature detectors,including scale-invariant feature transform(SIFT),speeded up robust features(SURF),and priori information based robust features(PIRF),are evaluated using the proposed approach on the TPI data set.Experimental results show that the robustness of PIRF outperforms others by over 6.2%.
文摘In gene prediction, the Fisher discriminant analysis (FDA) is used to separate protein coding region (exon) from non-coding regions (intron). Usually, the positive data set and the negative data set are of the same size if the number of the data is big enough. But for some situations the data are not sufficient or not equal, the threshold used in FDA may have important influence on prediction results. This paper presents a study on the selection of the threshold. The eigen value of each exon/intron sequence is computed using the Z-curve method with 69 variables. The experiments results suggest that the size and the standard deviation of the data sets and the threshold are the three key elements to be taken into consideration to improve the prediction results.
文摘Arctic region is experiencing strong warming and related changes in the state of sea ice, permafrost, tundra, marine environment and terrestrial ecosystems. These changes are found in any climatological data set comprising the Arctic region. This study compares the temperature trends in several surface, satellite and reanalysis data sets. We demonstrate large differences in the 1979-2002 temperature trends. Data sets disagree on the magnitude of the trends as well as on their seasonal, zonal and vertical pattern. It was found that the surface temperature trends are stronger than the trends in the tropospheric temperature for each latitude band north of 50?N for each month except for the months during the ice-melting season. These results emphasize that the conclusions of climate studies drawn on the basis of a single data set analysis should be treated with caution as they may be affected by the artificial biases in data.
基金the National Natural Science Foundation of China (Grant No. 50128706).
文摘Rough set (RS) and radial basis function neural network (RBFNN) based insulation data mining fault diagnosis for power transformer is proposed. On the one hand rough set is used as front of RBFNN to simplify the input of RBFNN and mine the rules. The mined rules whose “confidence” and “support” is higher than requirement are used to offer fault diagnosis service for power transformer directly. On the other hand the mining samples corresponding to the mined rule, whose “confidence and support” is lower than requirement, are used to be training samples set of RBFNN and these samples are clustered by rough set. The center of each clustering set is used to be center of radial basis function, i.e., as the hidden layer neuron. The RBFNN is structured with above base, which is used to diagnose the case that can not be diagnosed by mined simplified valuable rules based on rough set. The advantages and effectiveness of this method are verified by testing.
文摘Rough set theory is relativly new to area of soft computing to handle the uncertain big data efficiently. It also provides a powerful way to calculate the importance degree of vague and uncertain big data to help in decision making. Risk assessment is very important for safe and reliable investment. Risk management involves assessing the risk sources and designing strategies and procedures to mitigate those risks to an acceptable level. In this paper, we emphasize on classification of different types of risk factors and find a simple and effective way to calculate the risk exposure.. The study uses rough set method to classify and judge the safety attributes related to investment policy. The method which based on intelligent knowledge accusation provides an innovative way for risk analysis. From this approach, we are able to calculate the significance of each factor and relative risk exposure based on the original data without assigning the weight subjectively.