To address the shortage of public datasets for customs X-ray images of contraband and the difficulties in deploying trained models in engineering applications,a method has been proposed that employs the Extract-Transf...To address the shortage of public datasets for customs X-ray images of contraband and the difficulties in deploying trained models in engineering applications,a method has been proposed that employs the Extract-Transform-Load(ETL)approach to create an X-ray dataset of contraband items.Initially,X-ray scatter image data is collected and cleaned.Using Kafka message queues and the Elasticsearch(ES)distributed search engine,the data is transmitted in real-time to cloud servers.Subsequently,contraband data is annotated using a combination of neural networks and manual methods to improve annotation efficiency and implemented mean hash algorithm for quick image retrieval.The method of integrating targets with backgrounds has enhanced the X-ray contraband image data,increasing the number of positive samples.Finally,an Airport Customs X-ray dataset(ACXray)compatible with customs business scenarios has been constructed,featuring an increased number of positive contraband samples.Experimental tests using three datasets to train the Mask Region-based Convolutional Neural Network(Mask R-CNN)algorithm and tested on 400 real customs images revealed that the recognition accuracy of algorithms trained with Security Inspection X-ray(SIXray)and Occluded Prohibited Items X-ray(OPIXray)decreased by 16.3%and 15.1%,respectively,while the ACXray dataset trained algorithm’s accuracy was almost unaffected.This indicates that the ACXray dataset-trained algorithm possesses strong generalization capabilities and is more suitable for customs detection scenarios.展开更多
For the first time, this article introduces a LiDAR Point Clouds Dataset of Ships composed of both collected and simulated data to address the scarcity of LiDAR data in maritime applications. The collected data are ac...For the first time, this article introduces a LiDAR Point Clouds Dataset of Ships composed of both collected and simulated data to address the scarcity of LiDAR data in maritime applications. The collected data are acquired using specialized maritime LiDAR sensors in both inland waterways and wide-open ocean environments. The simulated data is generated by placing a ship in the LiDAR coordinate system and scanning it with a redeveloped Blensor that emulates the operation of a LiDAR sensor equipped with various laser beams. Furthermore,we also render point clouds for foggy and rainy weather conditions. To describe a realistic shipping environment, a dynamic tail wave is modeled by iterating the wave elevation of each point in a time series. Finally, networks serving small objects are migrated to ship applications by feeding our dataset. The positive effect of simulated data is described in object detection experiments, and the negative impact of tail waves as noise is verified in single-object tracking experiments. The Dataset is available at https://github.com/zqy411470859/ship_dataset.展开更多
One of the biggest dangers to society today is terrorism, where attacks have become one of the most significantrisks to international peace and national security. Big data, information analysis, and artificial intelli...One of the biggest dangers to society today is terrorism, where attacks have become one of the most significantrisks to international peace and national security. Big data, information analysis, and artificial intelligence (AI) havebecome the basis for making strategic decisions in many sensitive areas, such as fraud detection, risk management,medical diagnosis, and counter-terrorism. However, there is still a need to assess how terrorist attacks are related,initiated, and detected. For this purpose, we propose a novel framework for classifying and predicting terroristattacks. The proposed framework posits that neglected text attributes included in the Global Terrorism Database(GTD) can influence the accuracy of the model’s classification of terrorist attacks, where each part of the datacan provide vital information to enrich the ability of classifier learning. Each data point in a multiclass taxonomyhas one or more tags attached to it, referred as “related tags.” We applied machine learning classifiers to classifyterrorist attack incidents obtained from the GTD. A transformer-based technique called DistilBERT extracts andlearns contextual features from text attributes to acquiremore information from text data. The extracted contextualfeatures are combined with the “key features” of the dataset and used to perform the final classification. Thestudy explored different experimental setups with various classifiers to evaluate the model’s performance. Theexperimental results show that the proposed framework outperforms the latest techniques for classifying terroristattacks with an accuracy of 98.7% using a combined feature set and extreme gradient boosting classifier.展开更多
The scientific goal of the Anninghe seismic array is to investigate the detailed geometry of the Anninghe fault and the velocity structure of the fault zone.This 2D seismic array is composed of 161 stations forming su...The scientific goal of the Anninghe seismic array is to investigate the detailed geometry of the Anninghe fault and the velocity structure of the fault zone.This 2D seismic array is composed of 161 stations forming sub-rectangular geometry along the Anninghe fault,which covers 50 km and 150 km in the fault normal and strike directions,respectively,with~5 km intervals.The data were collected between June 2020 and June 2021,with some level of temporal gaps.Two types of instruments,i.e.QS-05A and SmartSolo,are used in this array.Data quality and examples of seismograms are provided in this paper.After the data protection period ends(expected in June 2024),researchers can request a dataset from the National Earthquake Science Data Center.展开更多
El Nino-Southern Oscillation(ENSO),the leading mode of global interannual variability,usually intensifies the Hadley Circulation(HC),and meanwhile constrains its meridional extension,leading to an equatorward movement...El Nino-Southern Oscillation(ENSO),the leading mode of global interannual variability,usually intensifies the Hadley Circulation(HC),and meanwhile constrains its meridional extension,leading to an equatorward movement of the jet system.Previous studies have investigated the response of HC to ENSO events using different reanalysis datasets and evaluated their capability in capturing the main features of ENSO-associated HC anomalies.However,these studies mainly focused on the global HC,represented by a zonal-mean mass stream function(MSF).Comparatively fewer studies have evaluated HC responses from a regional perspective,partly due to the prerequisite of the Stokes MSF,which prevents us from integrating a regional HC.In this study,we adopt a recently developed technique to construct the three-dimensional structure of HC and evaluate the capability of eight state-of-the-art reanalyses in reproducing the regional HC response to ENSO events.Results show that all eight reanalyses reproduce the spatial structure of HC responses well,with an intensified HC around the central-eastern Pacific but weakened circulations around the Indo-Pacific warm pool and tropical Atlantic.The spatial correlation coefficient of the three-dimensional HC anomalies among the different datasets is always larger than 0.93.However,these datasets may not capture the amplitudes of the HC responses well.This uncertainty is especially large for ENSO-associated equatorially asymmetric HC anomalies,with the maximum amplitude in Climate Forecast System Reanalysis(CFSR)being about 2.7 times the minimum value in the Twentieth Century Reanalysis(20CR).One should be careful when using reanalysis data to evaluate the intensity of ENSO-associated HC anomalies.展开更多
Gestational Diabetes Mellitus (GDM) is a significant health concern affecting pregnant women worldwide. It is characterized by elevated blood sugar levels during pregnancy and poses risks to both maternal and fetal he...Gestational Diabetes Mellitus (GDM) is a significant health concern affecting pregnant women worldwide. It is characterized by elevated blood sugar levels during pregnancy and poses risks to both maternal and fetal health. Maternal complications of GDM include an increased risk of developing type 2 diabetes later in life, as well as hypertension and preeclampsia during pregnancy. Fetal complications may include macrosomia (large birth weight), birth injuries, and an increased risk of developing metabolic disorders later in life. Understanding the demographics, risk factors, and biomarkers associated with GDM is crucial for effective management and prevention strategies. This research aims to address these aspects comprehensively through the analysis of a dataset comprising 600 pregnant women. By exploring the demographics of the dataset and employing data modeling techniques, the study seeks to identify key risk factors associated with GDM. Moreover, by analyzing various biomarkers, the research aims to gain insights into the physiological mechanisms underlying GDM and its implications for maternal and fetal health. The significance of this research lies in its potential to inform clinical practice and public health policies related to GDM. By identifying demographic patterns and risk factors, healthcare providers can better tailor screening and intervention strategies for pregnant women at risk of GDM. Additionally, insights into biomarkers associated with GDM may contribute to the development of novel diagnostic tools and therapeutic approaches. Ultimately, by enhancing our understanding of GDM, this research aims to improve maternal and fetal outcomes and reduce the burden of this condition on healthcare systems and society. However, it’s important to acknowledge the limitations of the dataset used in this study. Further research utilizing larger and more diverse datasets, perhaps employing advanced data analysis techniques such as Power BI, is warranted to corroborate and expand upon the findings of this research. This underscores the ongoing need for continued investigation into GDM to refine our understanding and improve clinical management strategies.展开更多
Automatic pavement crack detection is a critical task for maintaining the pavement stability and driving safety.The task is challenging because the shadows on the pavement may have similar intensity with the crack,whi...Automatic pavement crack detection is a critical task for maintaining the pavement stability and driving safety.The task is challenging because the shadows on the pavement may have similar intensity with the crack,which interfere with the crack detection performance.Till to the present,there still lacks efficient algorithm models and training datasets to deal with the interference brought by the shadows.To fill in the gap,we made several contributions as follows.First,we proposed a new pavement shadow and crack dataset,which contains a variety of shadow and pavement pixel size combinations.It also covers all common cracks(linear cracks and network cracks),placing higher demands on crack detection methods.Second,we designed a two-step shadow-removal-oriented crack detection approach:SROCD,which improves the performance of the algorithm by first removing the shadow and then detecting it.In addition to shadows,the method can cope with other noise disturbances.Third,we explored the mechanism of how shadows affect crack detection.Based on this mechanism,we propose a data augmentation method based on the difference in brightness values,which can adapt to brightness changes caused by seasonal and weather changes.Finally,we introduced a residual feature augmentation algorithm to detect small cracks that can predict sudden disasters,and the algorithm improves the performance of the model overall.We compare our method with the state-of-the-art methods on existing pavement crack datasets and the shadow-crack dataset,and the experimental results demonstrate the superiority of our method.展开更多
A M_(S)6.8 earthquake occurred on 5th September 2022 in Luding county,Sichuan,China,at 12:52 Beijing Time(4:52 UTC).We complied a dataset of PGA,PGV,and site vS30 of 73 accelerometers and 791 Micro-Electro-Mechanical ...A M_(S)6.8 earthquake occurred on 5th September 2022 in Luding county,Sichuan,China,at 12:52 Beijing Time(4:52 UTC).We complied a dataset of PGA,PGV,and site vS30 of 73 accelerometers and 791 Micro-Electro-Mechanical System(MEMS)sensors within 300 km of the epicenter.The inferred v_(S30)of 820 recording sites were validated.The study results show that:(1)The maximum horizontal PGA and PGV reaches 634.1 Gal and 71.1 cm/s respectively.(2)Over 80%of records are from soil sites.(3)The v_(S30)proxy model of Zhou J et al.(2022)is superior than that of Wald and Allen(2007)and performs well in the study area.The dataset was compiled in a flat file that consists the information of strong-motion instruments,the strong-motion records,and the v_(S30)of the recording sites.The dataset is available at https://www.seismisite.net.展开更多
Lane change prediction is critical for crash avoidance but challenging as it requires the understanding of the instantaneous driving environment.With cutting-edge artificial intelligence and sensing technologies,auton...Lane change prediction is critical for crash avoidance but challenging as it requires the understanding of the instantaneous driving environment.With cutting-edge artificial intelligence and sensing technologies,autonomous vehicles(AVs)are expected to have exceptional perception systems to capture instantaneously their driving environments for predicting lane changes.By exploring the Waymo open motion dataset,this study proposes a framework to explore autonomous driving data and investigate lane change behaviors.In the framework,this study develops a Long Short-Term Memory(LSTM)model to predict lane changing behaviors.The concept of Vehicle Operating Space(VOS)is introduced to quantify a vehicle's instantaneous driving environment as an important indicator used to predict vehicle lane changes.To examine the robustness of the model,a series of sensitivity analysis are conducted by varying the feature selection,prediction horizon,and training data balancing ratios.The test results show that including VOS into modeling can speed up the loss decay in the training process and lead to higher accuracy and recall for predicting lane-change behaviors.This study offers an example along with a methodological framework for transportation researchers to use emerging autonomous driving data to investigate driving behaviors and traffic environments.展开更多
Ocean salinity is an important variable that affects the ocean stratification.We compared the salinity and ocean stratification in the tropical Pacific derived from the Argo(Array for Real-time Geostrophic Oceanograph...Ocean salinity is an important variable that affects the ocean stratification.We compared the salinity and ocean stratification in the tropical Pacific derived from the Argo(Array for Real-time Geostrophic Oceanography data),EN4(Ensemble 4 analysis),SODA(the Simple Ocean Data Assimilation reanalysis),IAP(Institute of Atmospheric Physics data),and ORAS4(Ocean Reanalysis System 4)over 2005–2017.Results show that the spatial distribution of climatological mean of sea surface salinity(SSS)in all the products is consistent,and the low salinity region showed large deviation and strong dispersion.The Argo has the smallest RMSE and the highest correlation with the ensemble mean,while the IAP shows a high-salinity deviations relative to other datasets.All the products show high positive correlations between the sea surface density(SSD)and SSS with respect to the deviations of climatological mean from ensemble mean,suggesting that the SSD deviation may be mainly influenced by the SSS deviation.In the aspect of the ocean stratification,the mixed layer depth(MLD)climatological mean in the Argo shows the highest correlation with the ensemble mean,followed by EN4,IAP,ORAS4,and SODA.The Argo and EN4 show thicker barrier layer(BL)relative to the ensemble mean while the SODA displays the largest negative deviation in the tropical western Pacific.Furthermore,the EN4,ORAS4,and IAP underestimate the stability in the upper ocean at the depths of 20–140 m,while Argo overestimates ocean stability.The salinity fronts in the western-central equatorial Pacific from Argo,EN4,and ORAS4 are consistent,while those from SODA and IAP show large deviations with a westward position in amplitude of 0°–6°and 0°–10°,respectively.The SSS trend patterns from all the products are consistent in having ensemble mean with high spatial correlations of 0.95–0.97.展开更多
Data mining and analytics involve inspecting and modeling large pre-existing datasets to discover decision-making information.Precision agriculture uses datamining to advance agricultural developments.Many farmers are...Data mining and analytics involve inspecting and modeling large pre-existing datasets to discover decision-making information.Precision agriculture uses datamining to advance agricultural developments.Many farmers aren’t getting the most out of their land because they don’t use precision agriculture.They harvest crops without a well-planned recommendation system.Future crop production is calculated by combining environmental conditions and management behavior,yielding numerical and categorical data.Most existing research still needs to address data preprocessing and crop categorization/classification.Furthermore,statistical analysis receives less attention,despite producing more accurate and valid results.The study was conducted on a dataset about Karnataka state,India,with crops of eight parameters taken into account,namely the minimum amount of fertilizers required,such as nitrogen,phosphorus,potassium,and pH values.The research considers rainfall,season,soil type,and temperature parameters to provide precise cultivation recommendations for high productivity.The presented algorithm converts discrete numerals to factors first,then reduces levels.Second,the algorithm generates six datasets,two fromCase-1(dataset withmany numeric variables),two from Case-2(dataset with many categorical variables),and one from Case-3(dataset with reduced factor variables).Finally,the algorithm outputs a class membership allocation based on an extended version of the K-means partitioning method with lambda estimation.The presented work produces mixed-type datasets with precisely categorized crops by organizing data based on environmental conditions,soil nutrients,and geo-location.Finally,the prepared dataset solves the classification problem,leading to a model evaluation that selects the best dataset for precise crop prediction.展开更多
To manipulate the heterogeneous and distributed data better in the data grid,a dataspace management framework for grid data is proposed based on in-depth research on grid technology.Combining technologies in dataspace...To manipulate the heterogeneous and distributed data better in the data grid,a dataspace management framework for grid data is proposed based on in-depth research on grid technology.Combining technologies in dataspace management,such as data model iDM and query language iTrails,with the grid data access middleware OGSA-DAI,a grid dataspace management prototype system is built,in which tasks like data accessing,Abstraction,indexing,services management and answer-query are implemented by the OGSA-DAI workflows.Experimental results show that it is feasible to apply a dataspace management mechanism to the grid environment.Dataspace meets the grid data management needs in that it hides the heterogeneity and distribution of grid data and can adapt to the dynamic characteristics of the grid.The proposed grid dataspace management provides a new method for grid data management.展开更多
Version 4(v4) of the Extended Reconstructed Sea Surface Temperature(ERSST) dataset is compared with its precedent, the widely used version 3b(v3b). The essential upgrades applied to v4 lead to remarkable differences i...Version 4(v4) of the Extended Reconstructed Sea Surface Temperature(ERSST) dataset is compared with its precedent, the widely used version 3b(v3b). The essential upgrades applied to v4 lead to remarkable differences in the characteristics of the sea surface temperature(SST) anomaly(SSTa) in both the temporal and spatial domains. First, the largest discrepancy of the global mean SSTa values around the 1940 s is due to ship-observation corrections made to reconcile observations from buckets and engine intake thermometers. Second, differences in global and regional mean SSTa values between v4 and v3b exhibit a downward trend(around-0.032℃ per decade) before the 1940s, an upward trend(around 0.014℃ per decade) during the period of 1950–2015, interdecadal oscillation with one peak around the 1980s, and two troughs during the 1960s and 2000s, respectively. This does not derive from treatments of the polar or the other data-void regions, since the difference of the SSTa does not share the common features. Third, the spatial pattern of the ENSO-related variability of v4 exhibits a wider but weaker cold tongue in the tropical region of the Pacific Ocean compared with that of v3b, which could be attributed to differences in gap-filling assumptions since the latter features satellite observations whereas the former features in situ ones. This intercomparison confirms that the structural uncertainty arising from underlying assumptions on the treatment of diverse SST observations even in the same SST product family is the main source of significant SST differences in the temporal domain. Why this uncertainty introduces artificial decadal oscillations remains unknown.展开更多
The rapid growth of modern mobile devices leads to a large number of distributed data,which is extremely valuable for learning models.Unfortunately,model training by collecting all these original data to a centralized...The rapid growth of modern mobile devices leads to a large number of distributed data,which is extremely valuable for learning models.Unfortunately,model training by collecting all these original data to a centralized cloud server is not applicable due to data privacy and communication costs concerns,hindering artificial intelligence from empowering mobile devices.Moreover,these data are not identically and independently distributed(Non-IID)caused by their different context,which will deteriorate the performance of the model.To address these issues,we propose a novel Distributed Learning algorithm based on hierarchical clustering and Adaptive Dataset Condensation,named ADC-DL,which learns a shared model by collecting the synthetic samples generated on each device.To tackle the heterogeneity of data distribution,we propose an entropy topsis comprehensive tiering model for hierarchical clustering,which distinguishes clients in terms of their data characteristics.Subsequently,synthetic dummy samples are generated based on the hierarchical structure utilizing adaptive dataset condensation.The procedure of dataset condensation can be adjusted adaptively according to the tier of the client.Extensive experiments demonstrate that the performance of our ADC-DL is more outstanding in prediction accuracy and communication costs compared with existing algorithms.展开更多
A M_(S)6.4 earthquake occurred on 21 May 2021 in Yangbi county,Dali prefecture,Yunnan,China,at 21:48 Beijing Time(13:48 UTC).Earthquakes with an M3.0 or higher occurred before and after the main shock.Seismic data ana...A M_(S)6.4 earthquake occurred on 21 May 2021 in Yangbi county,Dali prefecture,Yunnan,China,at 21:48 Beijing Time(13:48 UTC).Earthquakes with an M3.0 or higher occurred before and after the main shock.Seismic data analysis is essential for the in-depth investigation of the 2021 Yangbi M_(S)6.4 earthquake sequence and the seismotectonics of northwestern Yunnan.Institute of Geophysics,China Earthquake Administration(CEA),has compiled a dataset of seismological observations from 157 broadband stations located within 500 km of the epicenter,and has made this dataset available to the earthquake science research community.The dataset(total file size:329 GB)consists of event waveforms with a sampling frequency of 100 sps collected from 18 to 28 May 2021,20-Hz and 100-Hz continuous waveforms collected from 12 to 31 May 2021,and seismic instrument response files.To promote data sharing,the dataset also includes the seismic event waveforms from 20 to 22 May 2021 recorded at 50 stations of the ongoing Binchuan Active Source Geophysical Observation Project,for which the data protection period has not expired.Sample waveforms of the main shock are included in the appendix of this article and can be downloaded from the Earthquake Science website.The event and continuous waveforms are available from the Earthquake Science Data Center website(www.esdc.ac.cn)on application.展开更多
Precise information on indoor positioning provides a foundation for position-related customer services.Despite the emergence of several indoor positioning technologies such as ultrawideband,infrared,radio frequency id...Precise information on indoor positioning provides a foundation for position-related customer services.Despite the emergence of several indoor positioning technologies such as ultrawideband,infrared,radio frequency identification,Bluetooth beacons,pedestrian dead reckoning,and magnetic field,Wi-Fi is one of the most widely used technologies.Predominantly,Wi-Fi fingerprinting is the most popular method and has been researched over the past two decades.Wi-Fi positioning faces three core problems:device heterogeneity,robustness to signal changes caused by human mobility,and device attitude,i.e.,varying orientations.The existing methods do not cover these aspects owing to the unavailability of publicly available datasets.This study introduces a dataset that includes the Wi-Fi received signal strength(RSS)gathered using four different devices,namely Samsung Galaxy S8,S9,A8,LG G6,and LG G7,operated by three surveyors,including a female and two males.In addition,three orientations of the smartphones are used for the data collection and include multiple buildings with a multifloor environment.Various levels of human mobility have been considered in dynamic environments.To analyze the time-related impact on Wi-Fi RSS,data over 3 years have been considered.展开更多
Distributed Denial of Service (DDoS) attacks are performed from multiple agents towards a single victim. Essentially, all attacking agents generate multiple packets towards the victim to overwhelm it with requests, th...Distributed Denial of Service (DDoS) attacks are performed from multiple agents towards a single victim. Essentially, all attacking agents generate multiple packets towards the victim to overwhelm it with requests, thereby overloading the resources of the victim. Since it is very complex and expensive to conduct a real DDoS attack, most organizations and researchers result in using simulations to mimic an actual attack. The researchers come up with diverse algorithms and mechanisms for attack detection and prevention. Further, simulation is good practice for determining the efficacy of an intrusive detective measure against DDoS attacks. However, some mechanisms are ineffective and thus not applied in real life attacks. Nowadays, DDoS attack has become more complex and modern for most IDS to detect. Adjustable and configurable traffic generator is becoming more and more important. This paper first details the available datasets that scholars use for DDoS attack detection. The paper further depicts the a few tools that exist freely and commercially for use in the simulation programs of DDoS attacks. In addition, a traffic generator for normal and different types of DDoS attack has been developed. The aim of the paper is to simulate a cloud environment by OMNET++ simulation tool, with different DDoS attack types. Generation normal and attack traffic can be useful to evaluate developing IDS for DDoS attacks detection. Moreover, the result traffic can be useful to test an effective algorithm, techniques and procedures of DDoS attacks.展开更多
The three-member historical simulations by the Chinese Academy of Sciences Flexible Global Ocean–Atmosphere–Land System model,version f3-L(CAS FGOALS-f3-L),which is contributing to phase 6 of the Coupled Model Inter...The three-member historical simulations by the Chinese Academy of Sciences Flexible Global Ocean–Atmosphere–Land System model,version f3-L(CAS FGOALS-f3-L),which is contributing to phase 6 of the Coupled Model Intercomparison Project(CMIP6),are described in this study.The details of the CAS FGOALS-f3-L model,experiment settings and output datasets are briefly introduced.The datasets include monthly and daily outputs from the atmospheric,oceanic,land and sea-ice component models of CAS FGOALS-f3-L,and all these data have been published online in the Earth System Grid Federation(ESGF,https://esgf-node.llnl.gov/projects/cmip6/).The three ensembles are initialized from the 600th,650th and 700th model year of the preindustrial experiment(piControl)and forced by the same historical forcing provided by CMIP6 from 1850 to 2014.The performance of the coupled model is validated in comparison with some recent observed atmospheric and oceanic datasets.It is shown that CAS FGOALS-f3-L is able to reproduce the main features of the modern climate,including the climatology of air surface temperature and precipitation,the long-term changes in global mean surface air temperature,ocean heat content and sea surface steric height,and the horizontal and vertical distribution of temperature in the ocean and atmosphere.Meanwhile,like other state-of-the-art coupled GCMs,there are still some obvious biases in the historical simulations,which are also illustrated.This paper can help users to better understand the advantages and biases of the model and the datasets。展开更多
基金supported by the National Natural Science Foundation of China(Grant No.51605069).
文摘To address the shortage of public datasets for customs X-ray images of contraband and the difficulties in deploying trained models in engineering applications,a method has been proposed that employs the Extract-Transform-Load(ETL)approach to create an X-ray dataset of contraband items.Initially,X-ray scatter image data is collected and cleaned.Using Kafka message queues and the Elasticsearch(ES)distributed search engine,the data is transmitted in real-time to cloud servers.Subsequently,contraband data is annotated using a combination of neural networks and manual methods to improve annotation efficiency and implemented mean hash algorithm for quick image retrieval.The method of integrating targets with backgrounds has enhanced the X-ray contraband image data,increasing the number of positive samples.Finally,an Airport Customs X-ray dataset(ACXray)compatible with customs business scenarios has been constructed,featuring an increased number of positive contraband samples.Experimental tests using three datasets to train the Mask Region-based Convolutional Neural Network(Mask R-CNN)algorithm and tested on 400 real customs images revealed that the recognition accuracy of algorithms trained with Security Inspection X-ray(SIXray)and Occluded Prohibited Items X-ray(OPIXray)decreased by 16.3%and 15.1%,respectively,while the ACXray dataset trained algorithm’s accuracy was almost unaffected.This indicates that the ACXray dataset-trained algorithm possesses strong generalization capabilities and is more suitable for customs detection scenarios.
基金supported by the National Natural Science Foundation of China (62173103)the Fundamental Research Funds for the Central Universities of China (3072022JC0402,3072022JC0403)。
文摘For the first time, this article introduces a LiDAR Point Clouds Dataset of Ships composed of both collected and simulated data to address the scarcity of LiDAR data in maritime applications. The collected data are acquired using specialized maritime LiDAR sensors in both inland waterways and wide-open ocean environments. The simulated data is generated by placing a ship in the LiDAR coordinate system and scanning it with a redeveloped Blensor that emulates the operation of a LiDAR sensor equipped with various laser beams. Furthermore,we also render point clouds for foggy and rainy weather conditions. To describe a realistic shipping environment, a dynamic tail wave is modeled by iterating the wave elevation of each point in a time series. Finally, networks serving small objects are migrated to ship applications by feeding our dataset. The positive effect of simulated data is described in object detection experiments, and the negative impact of tail waves as noise is verified in single-object tracking experiments. The Dataset is available at https://github.com/zqy411470859/ship_dataset.
文摘One of the biggest dangers to society today is terrorism, where attacks have become one of the most significantrisks to international peace and national security. Big data, information analysis, and artificial intelligence (AI) havebecome the basis for making strategic decisions in many sensitive areas, such as fraud detection, risk management,medical diagnosis, and counter-terrorism. However, there is still a need to assess how terrorist attacks are related,initiated, and detected. For this purpose, we propose a novel framework for classifying and predicting terroristattacks. The proposed framework posits that neglected text attributes included in the Global Terrorism Database(GTD) can influence the accuracy of the model’s classification of terrorist attacks, where each part of the datacan provide vital information to enrich the ability of classifier learning. Each data point in a multiclass taxonomyhas one or more tags attached to it, referred as “related tags.” We applied machine learning classifiers to classifyterrorist attack incidents obtained from the GTD. A transformer-based technique called DistilBERT extracts andlearns contextual features from text attributes to acquiremore information from text data. The extracted contextualfeatures are combined with the “key features” of the dataset and used to perform the final classification. Thestudy explored different experimental setups with various classifiers to evaluate the model’s performance. Theexperimental results show that the proposed framework outperforms the latest techniques for classifying terroristattacks with an accuracy of 98.7% using a combined feature set and extreme gradient boosting classifier.
基金supported by the National Key Research and Development Program of China(No.2018YFC1503401).
文摘The scientific goal of the Anninghe seismic array is to investigate the detailed geometry of the Anninghe fault and the velocity structure of the fault zone.This 2D seismic array is composed of 161 stations forming sub-rectangular geometry along the Anninghe fault,which covers 50 km and 150 km in the fault normal and strike directions,respectively,with~5 km intervals.The data were collected between June 2020 and June 2021,with some level of temporal gaps.Two types of instruments,i.e.QS-05A and SmartSolo,are used in this array.Data quality and examples of seismograms are provided in this paper.After the data protection period ends(expected in June 2024),researchers can request a dataset from the National Earthquake Science Data Center.
基金supported by the National Key Research and Development Program of China(Grant No.2018YFA0605703)the National Natural Science Foundation of China(Grant Nos.42176243,41976193 and 41676190)supported by National Natural Science Foundation of China(Grant No.41975079)。
文摘El Nino-Southern Oscillation(ENSO),the leading mode of global interannual variability,usually intensifies the Hadley Circulation(HC),and meanwhile constrains its meridional extension,leading to an equatorward movement of the jet system.Previous studies have investigated the response of HC to ENSO events using different reanalysis datasets and evaluated their capability in capturing the main features of ENSO-associated HC anomalies.However,these studies mainly focused on the global HC,represented by a zonal-mean mass stream function(MSF).Comparatively fewer studies have evaluated HC responses from a regional perspective,partly due to the prerequisite of the Stokes MSF,which prevents us from integrating a regional HC.In this study,we adopt a recently developed technique to construct the three-dimensional structure of HC and evaluate the capability of eight state-of-the-art reanalyses in reproducing the regional HC response to ENSO events.Results show that all eight reanalyses reproduce the spatial structure of HC responses well,with an intensified HC around the central-eastern Pacific but weakened circulations around the Indo-Pacific warm pool and tropical Atlantic.The spatial correlation coefficient of the three-dimensional HC anomalies among the different datasets is always larger than 0.93.However,these datasets may not capture the amplitudes of the HC responses well.This uncertainty is especially large for ENSO-associated equatorially asymmetric HC anomalies,with the maximum amplitude in Climate Forecast System Reanalysis(CFSR)being about 2.7 times the minimum value in the Twentieth Century Reanalysis(20CR).One should be careful when using reanalysis data to evaluate the intensity of ENSO-associated HC anomalies.
文摘Gestational Diabetes Mellitus (GDM) is a significant health concern affecting pregnant women worldwide. It is characterized by elevated blood sugar levels during pregnancy and poses risks to both maternal and fetal health. Maternal complications of GDM include an increased risk of developing type 2 diabetes later in life, as well as hypertension and preeclampsia during pregnancy. Fetal complications may include macrosomia (large birth weight), birth injuries, and an increased risk of developing metabolic disorders later in life. Understanding the demographics, risk factors, and biomarkers associated with GDM is crucial for effective management and prevention strategies. This research aims to address these aspects comprehensively through the analysis of a dataset comprising 600 pregnant women. By exploring the demographics of the dataset and employing data modeling techniques, the study seeks to identify key risk factors associated with GDM. Moreover, by analyzing various biomarkers, the research aims to gain insights into the physiological mechanisms underlying GDM and its implications for maternal and fetal health. The significance of this research lies in its potential to inform clinical practice and public health policies related to GDM. By identifying demographic patterns and risk factors, healthcare providers can better tailor screening and intervention strategies for pregnant women at risk of GDM. Additionally, insights into biomarkers associated with GDM may contribute to the development of novel diagnostic tools and therapeutic approaches. Ultimately, by enhancing our understanding of GDM, this research aims to improve maternal and fetal outcomes and reduce the burden of this condition on healthcare systems and society. However, it’s important to acknowledge the limitations of the dataset used in this study. Further research utilizing larger and more diverse datasets, perhaps employing advanced data analysis techniques such as Power BI, is warranted to corroborate and expand upon the findings of this research. This underscores the ongoing need for continued investigation into GDM to refine our understanding and improve clinical management strategies.
基金supported in part by the 14th Five-Year Project of Ministry of Science and Technology of China(2021YFD2000304)Fundamental Research Funds for the Central Universities(531118010509)Natural Science Foundation of Hunan Province,China(2021JJ40114)。
文摘Automatic pavement crack detection is a critical task for maintaining the pavement stability and driving safety.The task is challenging because the shadows on the pavement may have similar intensity with the crack,which interfere with the crack detection performance.Till to the present,there still lacks efficient algorithm models and training datasets to deal with the interference brought by the shadows.To fill in the gap,we made several contributions as follows.First,we proposed a new pavement shadow and crack dataset,which contains a variety of shadow and pavement pixel size combinations.It also covers all common cracks(linear cracks and network cracks),placing higher demands on crack detection methods.Second,we designed a two-step shadow-removal-oriented crack detection approach:SROCD,which improves the performance of the algorithm by first removing the shadow and then detecting it.In addition to shadows,the method can cope with other noise disturbances.Third,we explored the mechanism of how shadows affect crack detection.Based on this mechanism,we propose a data augmentation method based on the difference in brightness values,which can adapt to brightness changes caused by seasonal and weather changes.Finally,we introduced a residual feature augmentation algorithm to detect small cracks that can predict sudden disasters,and the algorithm improves the performance of the model overall.We compare our method with the state-of-the-art methods on existing pavement crack datasets and the shadow-crack dataset,and the experimental results demonstrate the superiority of our method.
基金supported by the National Natural Science Foundation of China(No.42120104002)the Program of China-Pakistan Joint Research Center on Earth Sciences.
文摘A M_(S)6.8 earthquake occurred on 5th September 2022 in Luding county,Sichuan,China,at 12:52 Beijing Time(4:52 UTC).We complied a dataset of PGA,PGV,and site vS30 of 73 accelerometers and 791 Micro-Electro-Mechanical System(MEMS)sensors within 300 km of the epicenter.The inferred v_(S30)of 820 recording sites were validated.The study results show that:(1)The maximum horizontal PGA and PGV reaches 634.1 Gal and 71.1 cm/s respectively.(2)Over 80%of records are from soil sites.(3)The v_(S30)proxy model of Zhou J et al.(2022)is superior than that of Wald and Allen(2007)and performs well in the study area.The dataset was compiled in a flat file that consists the information of strong-motion instruments,the strong-motion records,and the v_(S30)of the recording sites.The dataset is available at https://www.seismisite.net.
文摘Lane change prediction is critical for crash avoidance but challenging as it requires the understanding of the instantaneous driving environment.With cutting-edge artificial intelligence and sensing technologies,autonomous vehicles(AVs)are expected to have exceptional perception systems to capture instantaneously their driving environments for predicting lane changes.By exploring the Waymo open motion dataset,this study proposes a framework to explore autonomous driving data and investigate lane change behaviors.In the framework,this study develops a Long Short-Term Memory(LSTM)model to predict lane changing behaviors.The concept of Vehicle Operating Space(VOS)is introduced to quantify a vehicle's instantaneous driving environment as an important indicator used to predict vehicle lane changes.To examine the robustness of the model,a series of sensitivity analysis are conducted by varying the feature selection,prediction horizon,and training data balancing ratios.The test results show that including VOS into modeling can speed up the loss decay in the training process and lead to higher accuracy and recall for predicting lane-change behaviors.This study offers an example along with a methodological framework for transportation researchers to use emerging autonomous driving data to investigate driving behaviors and traffic environments.
基金Supported by the National Key Research and Development Program on MonitoringEarly Warning and Prevention of Major Natural Disaster (No.2019YFC1510004)the Laoshan Laboratory (No.LSKJ202202403)。
文摘Ocean salinity is an important variable that affects the ocean stratification.We compared the salinity and ocean stratification in the tropical Pacific derived from the Argo(Array for Real-time Geostrophic Oceanography data),EN4(Ensemble 4 analysis),SODA(the Simple Ocean Data Assimilation reanalysis),IAP(Institute of Atmospheric Physics data),and ORAS4(Ocean Reanalysis System 4)over 2005–2017.Results show that the spatial distribution of climatological mean of sea surface salinity(SSS)in all the products is consistent,and the low salinity region showed large deviation and strong dispersion.The Argo has the smallest RMSE and the highest correlation with the ensemble mean,while the IAP shows a high-salinity deviations relative to other datasets.All the products show high positive correlations between the sea surface density(SSD)and SSS with respect to the deviations of climatological mean from ensemble mean,suggesting that the SSD deviation may be mainly influenced by the SSS deviation.In the aspect of the ocean stratification,the mixed layer depth(MLD)climatological mean in the Argo shows the highest correlation with the ensemble mean,followed by EN4,IAP,ORAS4,and SODA.The Argo and EN4 show thicker barrier layer(BL)relative to the ensemble mean while the SODA displays the largest negative deviation in the tropical western Pacific.Furthermore,the EN4,ORAS4,and IAP underestimate the stability in the upper ocean at the depths of 20–140 m,while Argo overestimates ocean stability.The salinity fronts in the western-central equatorial Pacific from Argo,EN4,and ORAS4 are consistent,while those from SODA and IAP show large deviations with a westward position in amplitude of 0°–6°and 0°–10°,respectively.The SSS trend patterns from all the products are consistent in having ensemble mean with high spatial correlations of 0.95–0.97.
基金This research work was funded by the Institutional Fund Projects under Grant No.(IFPIP:959-611-1443)The authors gratefully acknowledge the technical and financial support provided by the Ministry of Education and King Abdulaziz University,DSR,Jeddah,Saudi Arabia.
文摘Data mining and analytics involve inspecting and modeling large pre-existing datasets to discover decision-making information.Precision agriculture uses datamining to advance agricultural developments.Many farmers aren’t getting the most out of their land because they don’t use precision agriculture.They harvest crops without a well-planned recommendation system.Future crop production is calculated by combining environmental conditions and management behavior,yielding numerical and categorical data.Most existing research still needs to address data preprocessing and crop categorization/classification.Furthermore,statistical analysis receives less attention,despite producing more accurate and valid results.The study was conducted on a dataset about Karnataka state,India,with crops of eight parameters taken into account,namely the minimum amount of fertilizers required,such as nitrogen,phosphorus,potassium,and pH values.The research considers rainfall,season,soil type,and temperature parameters to provide precise cultivation recommendations for high productivity.The presented algorithm converts discrete numerals to factors first,then reduces levels.Second,the algorithm generates six datasets,two fromCase-1(dataset withmany numeric variables),two from Case-2(dataset with many categorical variables),and one from Case-3(dataset with reduced factor variables).Finally,the algorithm outputs a class membership allocation based on an extended version of the K-means partitioning method with lambda estimation.The presented work produces mixed-type datasets with precisely categorized crops by organizing data based on environmental conditions,soil nutrients,and geo-location.Finally,the prepared dataset solves the classification problem,leading to a model evaluation that selects the best dataset for precise crop prediction.
文摘To manipulate the heterogeneous and distributed data better in the data grid,a dataspace management framework for grid data is proposed based on in-depth research on grid technology.Combining technologies in dataspace management,such as data model iDM and query language iTrails,with the grid data access middleware OGSA-DAI,a grid dataspace management prototype system is built,in which tasks like data accessing,Abstraction,indexing,services management and answer-query are implemented by the OGSA-DAI workflows.Experimental results show that it is feasible to apply a dataspace management mechanism to the grid environment.Dataspace meets the grid data management needs in that it hides the heterogeneity and distribution of grid data and can adapt to the dynamic characteristics of the grid.The proposed grid dataspace management provides a new method for grid data management.
基金supported by the National Key Basic Research and Development Plan (No.2015CB953900)the Natural Science Foundation of China (Nos.41330960 and 41776032)
文摘Version 4(v4) of the Extended Reconstructed Sea Surface Temperature(ERSST) dataset is compared with its precedent, the widely used version 3b(v3b). The essential upgrades applied to v4 lead to remarkable differences in the characteristics of the sea surface temperature(SST) anomaly(SSTa) in both the temporal and spatial domains. First, the largest discrepancy of the global mean SSTa values around the 1940 s is due to ship-observation corrections made to reconcile observations from buckets and engine intake thermometers. Second, differences in global and regional mean SSTa values between v4 and v3b exhibit a downward trend(around-0.032℃ per decade) before the 1940s, an upward trend(around 0.014℃ per decade) during the period of 1950–2015, interdecadal oscillation with one peak around the 1980s, and two troughs during the 1960s and 2000s, respectively. This does not derive from treatments of the polar or the other data-void regions, since the difference of the SSTa does not share the common features. Third, the spatial pattern of the ENSO-related variability of v4 exhibits a wider but weaker cold tongue in the tropical region of the Pacific Ocean compared with that of v3b, which could be attributed to differences in gap-filling assumptions since the latter features satellite observations whereas the former features in situ ones. This intercomparison confirms that the structural uncertainty arising from underlying assumptions on the treatment of diverse SST observations even in the same SST product family is the main source of significant SST differences in the temporal domain. Why this uncertainty introduces artificial decadal oscillations remains unknown.
基金the General Program of National Natural Science Foundation of China(62072049).
文摘The rapid growth of modern mobile devices leads to a large number of distributed data,which is extremely valuable for learning models.Unfortunately,model training by collecting all these original data to a centralized cloud server is not applicable due to data privacy and communication costs concerns,hindering artificial intelligence from empowering mobile devices.Moreover,these data are not identically and independently distributed(Non-IID)caused by their different context,which will deteriorate the performance of the model.To address these issues,we propose a novel Distributed Learning algorithm based on hierarchical clustering and Adaptive Dataset Condensation,named ADC-DL,which learns a shared model by collecting the synthetic samples generated on each device.To tackle the heterogeneity of data distribution,we propose an entropy topsis comprehensive tiering model for hierarchical clustering,which distinguishes clients in terms of their data characteristics.Subsequently,synthetic dummy samples are generated based on the hierarchical structure utilizing adaptive dataset condensation.The procedure of dataset condensation can be adjusted adaptively according to the tier of the client.Extensive experiments demonstrate that the performance of our ADC-DL is more outstanding in prediction accuracy and communication costs compared with existing algorithms.
文摘A M_(S)6.4 earthquake occurred on 21 May 2021 in Yangbi county,Dali prefecture,Yunnan,China,at 21:48 Beijing Time(13:48 UTC).Earthquakes with an M3.0 or higher occurred before and after the main shock.Seismic data analysis is essential for the in-depth investigation of the 2021 Yangbi M_(S)6.4 earthquake sequence and the seismotectonics of northwestern Yunnan.Institute of Geophysics,China Earthquake Administration(CEA),has compiled a dataset of seismological observations from 157 broadband stations located within 500 km of the epicenter,and has made this dataset available to the earthquake science research community.The dataset(total file size:329 GB)consists of event waveforms with a sampling frequency of 100 sps collected from 18 to 28 May 2021,20-Hz and 100-Hz continuous waveforms collected from 12 to 31 May 2021,and seismic instrument response files.To promote data sharing,the dataset also includes the seismic event waveforms from 20 to 22 May 2021 recorded at 50 stations of the ongoing Binchuan Active Source Geophysical Observation Project,for which the data protection period has not expired.Sample waveforms of the main shock are included in the appendix of this article and can be downloaded from the Earthquake Science website.The event and continuous waveforms are available from the Earthquake Science Data Center website(www.esdc.ac.cn)on application.
基金This research was supported by the Ministry of Science and ICT(MSIT),Korea,under the Information Technology Research Center(ITRC)support program(IITP-2020-2016-0-00313)supervised by the Institute for Information&communications Technology Planning&Evaluation(IITP)This research was supported by Basic Science Research Program through the National Research Foundation of Korea(NRF)funded by the Ministry of Science,ICT and Future Planning(2017R1E1A1A01074345).
文摘Precise information on indoor positioning provides a foundation for position-related customer services.Despite the emergence of several indoor positioning technologies such as ultrawideband,infrared,radio frequency identification,Bluetooth beacons,pedestrian dead reckoning,and magnetic field,Wi-Fi is one of the most widely used technologies.Predominantly,Wi-Fi fingerprinting is the most popular method and has been researched over the past two decades.Wi-Fi positioning faces three core problems:device heterogeneity,robustness to signal changes caused by human mobility,and device attitude,i.e.,varying orientations.The existing methods do not cover these aspects owing to the unavailability of publicly available datasets.This study introduces a dataset that includes the Wi-Fi received signal strength(RSS)gathered using four different devices,namely Samsung Galaxy S8,S9,A8,LG G6,and LG G7,operated by three surveyors,including a female and two males.In addition,three orientations of the smartphones are used for the data collection and include multiple buildings with a multifloor environment.Various levels of human mobility have been considered in dynamic environments.To analyze the time-related impact on Wi-Fi RSS,data over 3 years have been considered.
文摘Distributed Denial of Service (DDoS) attacks are performed from multiple agents towards a single victim. Essentially, all attacking agents generate multiple packets towards the victim to overwhelm it with requests, thereby overloading the resources of the victim. Since it is very complex and expensive to conduct a real DDoS attack, most organizations and researchers result in using simulations to mimic an actual attack. The researchers come up with diverse algorithms and mechanisms for attack detection and prevention. Further, simulation is good practice for determining the efficacy of an intrusive detective measure against DDoS attacks. However, some mechanisms are ineffective and thus not applied in real life attacks. Nowadays, DDoS attack has become more complex and modern for most IDS to detect. Adjustable and configurable traffic generator is becoming more and more important. This paper first details the available datasets that scholars use for DDoS attack detection. The paper further depicts the a few tools that exist freely and commercially for use in the simulation programs of DDoS attacks. In addition, a traffic generator for normal and different types of DDoS attack has been developed. The aim of the paper is to simulate a cloud environment by OMNET++ simulation tool, with different DDoS attack types. Generation normal and attack traffic can be useful to evaluate developing IDS for DDoS attacks detection. Moreover, the result traffic can be useful to test an effective algorithm, techniques and procedures of DDoS attacks.
基金This study is jointly supported by the Strategic Priority Research Program of Chinese Academy of Sciences(Grant Nos.XDA19060102 and XDB42010400)the Natural Science Foundation of China(Grant Nos.41530426,91958201 and 41931183).
文摘The three-member historical simulations by the Chinese Academy of Sciences Flexible Global Ocean–Atmosphere–Land System model,version f3-L(CAS FGOALS-f3-L),which is contributing to phase 6 of the Coupled Model Intercomparison Project(CMIP6),are described in this study.The details of the CAS FGOALS-f3-L model,experiment settings and output datasets are briefly introduced.The datasets include monthly and daily outputs from the atmospheric,oceanic,land and sea-ice component models of CAS FGOALS-f3-L,and all these data have been published online in the Earth System Grid Federation(ESGF,https://esgf-node.llnl.gov/projects/cmip6/).The three ensembles are initialized from the 600th,650th and 700th model year of the preindustrial experiment(piControl)and forced by the same historical forcing provided by CMIP6 from 1850 to 2014.The performance of the coupled model is validated in comparison with some recent observed atmospheric and oceanic datasets.It is shown that CAS FGOALS-f3-L is able to reproduce the main features of the modern climate,including the climatology of air surface temperature and precipitation,the long-term changes in global mean surface air temperature,ocean heat content and sea surface steric height,and the horizontal and vertical distribution of temperature in the ocean and atmosphere.Meanwhile,like other state-of-the-art coupled GCMs,there are still some obvious biases in the historical simulations,which are also illustrated.This paper can help users to better understand the advantages and biases of the model and the datasets。