Machine learning is becoming increasingly important in scientific and technological progress,due to its ability to create models that describe complex data and generalize well.The wealth of publicly-available seismic ...Machine learning is becoming increasingly important in scientific and technological progress,due to its ability to create models that describe complex data and generalize well.The wealth of publicly-available seismic data nowadays requires automated,fast,and reliable tools to carry out a multitude of tasks,such as the detection of small,local earthquakes in areas characterized by sparsity of receivers.A similar application of machine learning,however,should be built on a large amount of labeled seismograms,which is neither immediate to obtain nor to compile.In this study we present a large dataset of seismograms recorded along the vertical,north,and east components of 1487 broad-band or very broad-band receivers distributed worldwide;this includes 629,0953-component seismograms generated by 304,878 local earthquakes and labeled as EQ,and 615,847 ones labeled as noise(AN).Application of machine learning to this dataset shows that a simple Convolutional Neural Network of 67,939 parameters allows discriminating between earthquakes and noise single-station recordings,even if applied in regions not represented in the training set.Achieving an accuracy of 96.7,95.3,and 93.2% on training,validation,and test set,respectively,we prove that the large variety of geological and tectonic settings covered by our data supports the generalization capabilities of the algorithm,and makes it applicable to real-time detection of local events.We make the database publicly available,intending to provide the seismological and broader scientific community with a benchmark for time-series to be used as a testing ground in signal processing.展开更多
Purpose: The authors aim at testing the performance of a set of machine learning algorithms that could improve the process of data cleaning when building datasets. Design/methodology/approach: The paper is centered ...Purpose: The authors aim at testing the performance of a set of machine learning algorithms that could improve the process of data cleaning when building datasets. Design/methodology/approach: The paper is centered on cleaning datasets gathered from publishers and online resources by the use of specific keywords. In this case, we analyzed data from the Web of Science. The accuracy of various forms of automatic classification was tested here in comparison with manual coding in order to determine their usefulness for data collection and cleaning. We assessed the performance of seven supervised classification algorithms (Support Vector Machine (SVM), Scaled Linear Discriminant Analysis, Lasso and elastic-net regularized generalized linear models, Maximum Entropy, Regression Tree, Boosting, and Random Forest) and analyzed two properties: accuracy and recall. We assessed not only each algorithm individually, but also their combinations through a voting scheme. We also tested the performance of these algorithms with different sizes of training data. When assessing the performance of different combinations, we used an indicator of coverage to account for the agreement and disagreement on classification between algorithms. Findings: We found that the performance of the algorithms used vary with the size of the sample for training. However, for the classification exercise in this paper the best performing algorithms were SVM and Boosting. The combination of these two algorithms achieved a high agreement on coverage and was highly accurate. This combination performs well with a small training dataset (10%), which may reduce the manual work needed for classification tasks. Research limitations: The dataset gathered has significantly more records related to the topic of interest compared to unrelated topics. This may affect the performance of some algorithms, especially in their identification of unrelated papers. Practical implications: Although the classification achieved by this means is not completely accurate, the amount of manual coding needed can be greatly reduced by using classification algorithms. This can be of great help when the dataset is big. With the help of accuracy, recall,and coverage measures, it is possible to have an estimation of the error involved in this classification, which could open the possibility of incorporating the use of these algorithms in software specifically designed for data cleaning and classification.展开更多
Despite exploration and production success in Niger Delta,several failed wells have been encountered due to overpressures.Hence,it is very essential to understand the spatial distribution of pore pressure and the gene...Despite exploration and production success in Niger Delta,several failed wells have been encountered due to overpressures.Hence,it is very essential to understand the spatial distribution of pore pressure and the generating mechanisms in order to mitigate the pitfalls that might arise during drilling.This research provides estimates of pore pressure along three offshore wells using the Eaton's transit time method,multi-layer perceptron artificial neural network(MLP-ANN)and random forest regression(RFR)algorithms.Our results show that there are three pressure magnitude regimes:normal pressure zone(hydrostatic pressure),transition pressure zone(slightly above hydrostatic pressure),and over pressured zone(significantly above hydrostatic pressure).The top of the geopressured zone(2873 mbRT or 9425.853 ft)averagely marks the onset of overpressurization with the excess pore pressure above hydrostatic pressure(P∗)varying averagely along the three wells between 1.06−24.75 MPa.The results from the three methods are self-consistent with strong correlation between the Eaton's method and the two machine learning models.The models have high accuracy of about>97%,low mean absolute percentage error(MAPE<3%)and coefficient of determination(R2>0.98).Our results have also shown that the principal generating mechanisms responsible for high pore pressure in the offshore Niger Delta are disequilibrium compaction,unloading(fluid expansion)and shale diagenesis.展开更多
文摘Machine learning is becoming increasingly important in scientific and technological progress,due to its ability to create models that describe complex data and generalize well.The wealth of publicly-available seismic data nowadays requires automated,fast,and reliable tools to carry out a multitude of tasks,such as the detection of small,local earthquakes in areas characterized by sparsity of receivers.A similar application of machine learning,however,should be built on a large amount of labeled seismograms,which is neither immediate to obtain nor to compile.In this study we present a large dataset of seismograms recorded along the vertical,north,and east components of 1487 broad-band or very broad-band receivers distributed worldwide;this includes 629,0953-component seismograms generated by 304,878 local earthquakes and labeled as EQ,and 615,847 ones labeled as noise(AN).Application of machine learning to this dataset shows that a simple Convolutional Neural Network of 67,939 parameters allows discriminating between earthquakes and noise single-station recordings,even if applied in regions not represented in the training set.Achieving an accuracy of 96.7,95.3,and 93.2% on training,validation,and test set,respectively,we prove that the large variety of geological and tectonic settings covered by our data supports the generalization capabilities of the algorithm,and makes it applicable to real-time detection of local events.We make the database publicly available,intending to provide the seismological and broader scientific community with a benchmark for time-series to be used as a testing ground in signal processing.
基金supported by National Natural Science Foundation of China(NSFC)(Grant No.:71173154)The National Social Science Fund of China(NSSFC)(Grant No.:08BZX076)the Fundamental Research Funds for the Central Universities
文摘Purpose: The authors aim at testing the performance of a set of machine learning algorithms that could improve the process of data cleaning when building datasets. Design/methodology/approach: The paper is centered on cleaning datasets gathered from publishers and online resources by the use of specific keywords. In this case, we analyzed data from the Web of Science. The accuracy of various forms of automatic classification was tested here in comparison with manual coding in order to determine their usefulness for data collection and cleaning. We assessed the performance of seven supervised classification algorithms (Support Vector Machine (SVM), Scaled Linear Discriminant Analysis, Lasso and elastic-net regularized generalized linear models, Maximum Entropy, Regression Tree, Boosting, and Random Forest) and analyzed two properties: accuracy and recall. We assessed not only each algorithm individually, but also their combinations through a voting scheme. We also tested the performance of these algorithms with different sizes of training data. When assessing the performance of different combinations, we used an indicator of coverage to account for the agreement and disagreement on classification between algorithms. Findings: We found that the performance of the algorithms used vary with the size of the sample for training. However, for the classification exercise in this paper the best performing algorithms were SVM and Boosting. The combination of these two algorithms achieved a high agreement on coverage and was highly accurate. This combination performs well with a small training dataset (10%), which may reduce the manual work needed for classification tasks. Research limitations: The dataset gathered has significantly more records related to the topic of interest compared to unrelated topics. This may affect the performance of some algorithms, especially in their identification of unrelated papers. Practical implications: Although the classification achieved by this means is not completely accurate, the amount of manual coding needed can be greatly reduced by using classification algorithms. This can be of great help when the dataset is big. With the help of accuracy, recall,and coverage measures, it is possible to have an estimation of the error involved in this classification, which could open the possibility of incorporating the use of these algorithms in software specifically designed for data cleaning and classification.
文摘Despite exploration and production success in Niger Delta,several failed wells have been encountered due to overpressures.Hence,it is very essential to understand the spatial distribution of pore pressure and the generating mechanisms in order to mitigate the pitfalls that might arise during drilling.This research provides estimates of pore pressure along three offshore wells using the Eaton's transit time method,multi-layer perceptron artificial neural network(MLP-ANN)and random forest regression(RFR)algorithms.Our results show that there are three pressure magnitude regimes:normal pressure zone(hydrostatic pressure),transition pressure zone(slightly above hydrostatic pressure),and over pressured zone(significantly above hydrostatic pressure).The top of the geopressured zone(2873 mbRT or 9425.853 ft)averagely marks the onset of overpressurization with the excess pore pressure above hydrostatic pressure(P∗)varying averagely along the three wells between 1.06−24.75 MPa.The results from the three methods are self-consistent with strong correlation between the Eaton's method and the two machine learning models.The models have high accuracy of about>97%,low mean absolute percentage error(MAPE<3%)and coefficient of determination(R2>0.98).Our results have also shown that the principal generating mechanisms responsible for high pore pressure in the offshore Niger Delta are disequilibrium compaction,unloading(fluid expansion)and shale diagenesis.