This paper presents a methodology to determine three data quality (DQ) risk characteristics: accuracy, comprehensiveness and nonmembership. The methodology provides a set of quantitative models to confirm the informat...This paper presents a methodology to determine three data quality (DQ) risk characteristics: accuracy, comprehensiveness and nonmembership. The methodology provides a set of quantitative models to confirm the information quality risks for the database of the geographical information system (GIS). Four quantitative measures are introduced to examine how the quality risks of source information affect the quality of information outputs produced using the relational algebra operations Selection, Projection, and Cubic Product. It can be used to determine how quality risks associated with diverse data sources affect the derived data. The GIS is the prime source of information on the location of cables, and detection time strongly depends on whether maps indicate the presence of cables in the construction business. Poor data quality in the GIS can contribute to increased risk or higher risk avoidance costs. A case study provides a numerical example of the calculation of the trade-offs between risk and detection costs and provides an example of the calculation of the costs of data quality. We conclude that the model contributes valuable new insight.展开更多
Since the British National Archive put forward the concept of the digital continuity in 2007,several developed countries have worked out their digital continuity action plan.However,the technologies of the digital con...Since the British National Archive put forward the concept of the digital continuity in 2007,several developed countries have worked out their digital continuity action plan.However,the technologies of the digital continuity guarantee are still lacked.At first,this paper analyzes the requirements of digital continuity guarantee for electronic record based on data quality theory,then points out the necessity of data quality guarantee for electronic record.Moreover,we convert the digital continuity guarantee of electronic record to ensure the consistency,completeness and timeliness of electronic record,and construct the first technology framework of the digital continuity guarantee for electronic record.Finally,the temporal functional dependencies technology is utilized to build the first integration method to insure the consistency,completeness and timeliness of electronic record.展开更多
One of the goals of data collection is preparing for decision-making, so high quality requirement must be satisfied. Rational evaluation of data quality is an effective way to identify data problem in time, and the qu...One of the goals of data collection is preparing for decision-making, so high quality requirement must be satisfied. Rational evaluation of data quality is an effective way to identify data problem in time, and the quality of data after this evaluation is satisfactory with the requirement of decision maker. A fuzzy neural network based research method of data quality evaluation is proposed. First, the criteria for the evaluation of data quality are selected to construct the fuzzy sets of evaluating grades, and then by using the learning ability of NN, the objective evaluation of membership is carried out, which can be used for the effective evaluation of data quality. This research has been used in the platform of 'data report of national compulsory education outlay guarantee' from the Chinese Ministry of Education. This method can be used for the effective evaluation of data quality worldwide, and the data quality situation can be found out more completely, objectively, and in better time by using the method.展开更多
<span style="font-family:Verdana;">Most GIS databases contain data errors. The quality of the data sources such as traditional paper maps or more recent remote sensing data determines spatial data qual...<span style="font-family:Verdana;">Most GIS databases contain data errors. The quality of the data sources such as traditional paper maps or more recent remote sensing data determines spatial data quality. In the past several decades, different statistical measures have been developed to evaluate data quality for different types of data, such as nominal categorical data, ordinal categorical data and numerical data. Although these methods were originally proposed for medical research or psychological research, they have been widely used to evaluate spatial data quality. In this paper, we first review statistical methods for evaluating data quality, discuss under what conditions we should use them and how to interpret the results, followed by a brief discussion of statistical software and packages that can be used to compute these data quality measures.</span>展开更多
In contrast with the research of new models,little attention has been paid to the impact of low or high-quality data feeding a dialogue system.The present paper makes thefirst attempt tofill this gap by extending our ...In contrast with the research of new models,little attention has been paid to the impact of low or high-quality data feeding a dialogue system.The present paper makes thefirst attempt tofill this gap by extending our previous work on question-answering(QA)systems by investigating the effect of misspelling on QA agents and how context changes can enhance the responses.Instead of using large language models trained on huge datasets,we propose a method that enhances the model's score by modifying only the quality and structure of the data feed to the model.It is important to identify the features that modify the agent performance because a high rate of wrong answers can make the students lose their interest in using the QA agent as an additional tool for distant learning.The results demonstrate the accuracy of the proposed context simplification exceeds 85%.Thesefindings shed light on the importance of question data quality and context complexity construct as key dimensions of the QA system.In conclusion,the experimental results on questions and contexts showed that controlling and improving the various aspects of data quality around the QA system can significantly enhance his robustness and performance.展开更多
Several organizations have migrated to the cloud for better quality in business engagements and security. Data quality is crucial in present-day activities. Information is generated and collected from data representin...Several organizations have migrated to the cloud for better quality in business engagements and security. Data quality is crucial in present-day activities. Information is generated and collected from data representing real-time facts and activities. Poor data quality affects the organizational decision-making policy and customer satisfaction, and influences the organization’s scheme of execution negatively. Data quality also has a massive influence on the accuracy, complexity and efficiency of the machine and deep learning tasks. There are several methods and tools to evaluate data quality to ensure smooth incorporation in model development. The bulk of data quality tools permit the assessment of sources of data only at a certain point in time, and the arrangement and automation are consequently an obligation of the user. In ensuring automatic data quality, several steps are involved in gathering data from different sources and monitoring data quality, and any problems with the data quality must be adequately addressed. There was a gap in the literature as no attempts have been made previously to collate all the advances in different dimensions of automatic data quality. This limited narrative review of existing literature sought to address this gap by correlating different steps and advancements related to the automatic data quality systems. The six crucial data quality dimensions in organizations were discussed, and big data were compared and classified. This review highlights existing data quality models and strategies that can contribute to the development of automatic data quality systems.展开更多
Today, the quantity of data continues to increase, furthermore, the data are heterogeneous, from multiple sources (structured, semi-structured and unstructured) and with different levels of quality. Therefore, it is v...Today, the quantity of data continues to increase, furthermore, the data are heterogeneous, from multiple sources (structured, semi-structured and unstructured) and with different levels of quality. Therefore, it is very likely to manipulate data without knowledge about their structures and their semantics. In fact, the meta-data may be insufficient or totally absent. Data Anomalies may be due to the poverty of their semantic descriptions, or even the absence of their description. In this paper, we propose an approach to better understand the semantics and the structure of the data. Our approach helps to correct automatically the intra-column anomalies and the inter-col- umns ones. We aim to improve the quality of data by processing the null values and the semantic dependencies between columns.展开更多
Background: High data quality provides correct and up-to-date information which is critical to ensure, not only for the maintenance of health care at an optimal level, but also for the provision of high-quality clinic...Background: High data quality provides correct and up-to-date information which is critical to ensure, not only for the maintenance of health care at an optimal level, but also for the provision of high-quality clinical care, continuing health care, clinical and health service research, and planning and management of health systems. For the attainment of achievable improvements in the health sector, good data is core. Aim/Objective: To assess the level of knowledge and practices of Community Health Nurses on data quality in the Ho municipality, Ghana. Methods: A descriptive cross-sectional study was employed for the study, using a standard Likert scale questionnaire. A census was used to collect 77 Community Health Nurses’ information. The statistical software, Epi-Data 3.1 was used to enter the data and exported to STATA 12.0 for the analyses. Chi-square and logistic analyses were performed to establish associations between categorical variables and a p-value of less than 0.05 at 95% significance interval was considered statistically significant. Results: Out of the 77 Community Health Nurses studied, 49 (63.64%) had good knowledge on data accuracy, 51 (66.23%) out of the 77 Community Health Nurses studied had poor knowledge on data completeness, and 64 (83.12%) had poor knowledge on data timeliness out of the 77 studied. Also, 16 (20.78%) and 33 (42.86%) of the 77 Community Health Nurses responded there was no designated staff for data quality review and no feedback from the health directorate respectively. Out of the 16 health facilities studied for data quality practices, half (8, 50.00%) had missing values on copies of their previous months’ report forms. More so, 10 (62.50%) had no reminders (monthly data submission itineraries) at the facility level. Conclusion: Overall, the general level of knowledge of Community Health Nurses on data quality was poor and their practices for improving data quality at the facility level were woefully inadequate. Therefore, Community Health Nurses need to be given on-job training and proper education on data quality and its dimensions. Also, the health directorate should intensify its continuous supportive supervisory visits at all facilities and feedback should be given to the Community Health Nurses on the data submitted.展开更多
By researching the data quality problem in the monitoring and diagnosis system (MDS) , the method of detecting non-condition data based on the development trend of equipment condition is proposed, and three requirem...By researching the data quality problem in the monitoring and diagnosis system (MDS) , the method of detecting non-condition data based on the development trend of equipment condition is proposed, and three requirements of criteria for detecting non-condition data: dynamic, syntheses and simplicity are discussed. According to the general mode of data management in MDS, a data quality assurance system (DQAS) comprising data quality monitoring, data quality diagnosis, detection criteria adjusting and artificial confirmation is set up. A route inspection system called MTREE realizes the DQAS. Aiming at vibration data of route inspection, two detecting criteria are made. One is the quality monitoring parameter, which is found through combining and optimizing some fundamental parameters by genetic programming (GP). The other is the quality diagnosis criterion, i. e. pseudo distance of Spectral-Energy-Vector (SEV) named Adjacent J-divergence, which indicates the variation degree of adjacent data's spectral energy distribution. Results show that DQAS, including these two criteria, is effective to improve the data quality of MDS.展开更多
The in-orbit commissioning of ZY-1 02C satellite is proceeding smoothly. According to the relevant experts in this field, the imagery quality of the satellite has reached or nearly reached the level of international s...The in-orbit commissioning of ZY-1 02C satellite is proceeding smoothly. According to the relevant experts in this field, the imagery quality of the satellite has reached or nearly reached the level of international satellites of the same kind. ZY-1 02C satellite and ZY-3 satellite were successfully launched on December 22, 2011 and January 9, 2012 respectively. China Centre for Resources Satellite Data andApplication (CRSDA) was responsible for the building of a ground展开更多
Nowadays,data are more and more used for intelligent modeling and prediction,and the comprehensive evaluation of data quality is getting more and more attention as a necessary means to measure whether the data are usa...Nowadays,data are more and more used for intelligent modeling and prediction,and the comprehensive evaluation of data quality is getting more and more attention as a necessary means to measure whether the data are usable or not.However,the comprehensive evaluation method of data quality mostly contains the subjective factors of the evaluator,so how to comprehensively and objectively evaluate the data has become a bottleneck that needs to be solved in the research of comprehensive evaluation method.In order to evaluate the data more comprehensively,objectively and differentially,a novel comprehensive evaluation method based on particle swarm optimization(PSO)and grey correlation analysis(GCA)is presented in this paper.At first,an improved GCA evaluation model based on the technique for order preference by similarity to an ideal solution(TOPSIS)is proposed.Then,an objective function model of maximum difference of the comprehensive evaluation values is built,and the PSO algorithm is used to optimize the weights of the improved GCA evaluation model based on the objective function model.Finally,the performance of the proposed method is investigated through parameter analysis.A performance comparison of traffic flow data is carried out,and the simulation results show that the maximum average difference between the evaluation results and its mean value(MDR)of the proposed comprehensive evaluation method is 33.24%higher than that of TOPSIS-GCA,and 6.86%higher than that of GCA.The proposed method has better differentiation than other methods,which means that it objectively and comprehensively evaluates the data from both the relevance and differentiation of the data,and the results more effectively reflect the differences in data quality,which will provide more effective data support for intelligent modeling,prediction and other applications.展开更多
Timestamps play a key role in process mining because it determines the chronology of which events occurred and subsequently how they are ordered in process modelling.The timestamp in process mining gives an insight on...Timestamps play a key role in process mining because it determines the chronology of which events occurred and subsequently how they are ordered in process modelling.The timestamp in process mining gives an insight on process performance,conformance,and modelling.This therefore means problems with the timestamp will result in misrepresentations of the mined process.A few articles have been published on the quantification of data quality problems but just one of the articles at the time of this paper is based on the quantification of timestamp quality problems.This article evaluates the quality of timestamps in event log across two axes using eleven quality dimensions and four levels of potential data quality problems.The eleven data quality dimensions were obtained by doing a thorough literature review of more than fifty process mining articles which focus on quality dimensions.This evaluation resulted in twelve data quality quantification metrics and the metrics were applied to the MIMIC-ll dataset as an illustration.The outcome of the timestamp quality quantification using the proposed typology enabled the user to appreciate the quality of the event log and thus makes it possible to evaluate the risk of carrying out specific data cleaning measures to improve the process mining outcome.展开更多
Real world study (RWS) has become a hotspot for clinical research. Data quality plays a vital role in research achievement and other clinical research fields. In this paper, the common quality problems in the RWS of...Real world study (RWS) has become a hotspot for clinical research. Data quality plays a vital role in research achievement and other clinical research fields. In this paper, the common quality problems in the RWS of traditional Chinese medicine are discussed, and a countermeasure is proposed.展开更多
A C-band mobile polarimetric radar with simultaneous horizontal and vertical transmission was built in the State Key Laboratory of Severe Weather, Chinese Academy of Meteorological Sciences. It was used in heavy rainf...A C-band mobile polarimetric radar with simultaneous horizontal and vertical transmission was built in the State Key Laboratory of Severe Weather, Chinese Academy of Meteorological Sciences. It was used in heavy rainfall and typhoon observations in 2008. It is well-known that radar calibration is essential and critical to high quality radar data and products. In this paper, the test and weather signals were used in calibration of reflectivity ZH, differential reflectivity ZDR and differential phase ФDP. Noise effects on correlation coefficient ρHV at low signal-noise-ratio (SNR) were analyzed. The polarimetric radar data for a heavy rain and a snow event were inspected to evaluate the performance of the calibration method and radar data quality, and S-band Doppler radar data were used to validate the refiectivity data quality collected by the polarimetric radar. The results show that the polarimetric and S-band Doppler radars have observed comparable reflectivity values and a similar structure of a heavy rainfall case at middle and low levels. The mismatch of two receivers produce obvious ZDR biases, which were verified by the radar data observed at vertical incidence. The ZDR correction improved the radar data quality. The usage range for PHV was defined. Application of the calibration method introduced in this paper can reduce the system biases caused by the difference of horizontal (H) and vertical (V) channels. After the calibration and correction, the polarimetric parameters observed by the polarimetric radar could be used in further relevant researches.展开更多
Synchrophasor systems, providing low-latency,high-precision, and time-synchronized measurements to enhance power grid performances, are deployed globally.However, the synchrophasor system as a physical network,involve...Synchrophasor systems, providing low-latency,high-precision, and time-synchronized measurements to enhance power grid performances, are deployed globally.However, the synchrophasor system as a physical network,involves communication constraints and data quality issues, which will impact or even disable certain synchrophasor applications. This work investigates the data quality issue for synchrophasor applications. In Part I, the standards of synchrophasor systems and the classifications and data quality requirements of synchrophasor applications are reviewed. Also, the actual events of synchronization signal accuracy, synchrophasor data loss, and latency are counted and analyzed. The review and statistics are expected to provide an overall picture of data accuracy,loss, and latency issues for synchrophasor applications.展开更多
Data quality management,especially data cleansing,has been extensively studied for many years in the areas of data management and visual analytics.In the paper,we first review and explore the relevant work from the re...Data quality management,especially data cleansing,has been extensively studied for many years in the areas of data management and visual analytics.In the paper,we first review and explore the relevant work from the research areas of data management,visual analytics and human-computer interaction.Then for different types of data such as multimedia data,textual data,trajectory data,and graph data,we summarize the common methods for improving data quality by leveraging data cleansing techniques at different analysis stages.Based on a thorough analysis,we propose a general visual analytics framework for interactively cleansing data.Finally,the challenges and opportunities are analyzed and discussed in the context of data and humans.展开更多
The real-time energy flow data obtained in industrial production processes are usually of low quality.It is difficult to accurately predict the short-term energy flow profile by using these field data,which diminishes...The real-time energy flow data obtained in industrial production processes are usually of low quality.It is difficult to accurately predict the short-term energy flow profile by using these field data,which diminishes the effect of industrial big data and artificial intelligence in industrial energy system.The real-time data of blast furnace gas(BFG)generation collected in iron and steel sites are also of low quality.In order to tackle this problem,a three-stage data quality improvement strategy was proposed to predict the BFG generation.In the first stage,correlation principle was used to test the sample set.In the second stage,the original sample set was rectified and updated.In the third stage,Kalman filter was employed to eliminate the noise of the updated sample set.The method was verified by autoregressive integrated moving average model,back propagation neural network model and long short-term memory model.The results show that the prediction model based on the proposed three-stage data quality improvement method performs well.Long short-term memory model has the best prediction performance,with a mean absolute error of 17.85 m3/min,a mean absolute percentage error of 0.21%,and an R squared of 95.17%.展开更多
Wired drill pipe(WDP)technology is one of the most promising data acquisition technologies in today s oil and gas industry.For the first time it allows sensors to be positioned along the drill string which enables c...Wired drill pipe(WDP)technology is one of the most promising data acquisition technologies in today s oil and gas industry.For the first time it allows sensors to be positioned along the drill string which enables collecting and transmitting valuable data not only from the bottom hole assembly(BHA),but also along the entire length of the wellbore to the drill floor.The technology has received industry acceptance as a viable alternative to the typical logging while drilling(LWD)method.Recently more and more WDP applications can be found in the challenging drilling environments around the world,leading to many innovations to the industry.Nevertheless most of the data acquired from WDP can be noisy and in some circumstances of very poor quality.Diverse factors contribute to the poor data quality.Most common sources include mis-calibrated sensors,sensor drifting,errors during data transmission,or some abnormal conditions in the well,etc.The challenge of improving the data quality has attracted more and more focus from many researchers during the past decade.This paper has proposed a promising solution to address such challenge by making corrections of the raw WDP data and estimating unmeasurable parameters to reveal downhole behaviors.An advanced data processing method,data validation and reconciliation(DVR)has been employed,which makes use of the redundant data from multiple WDP sensors to filter/remove the noise from the measurements and ensures the coherence of all sensors and models.Moreover it has the ability to distinguish the accurate measurements from the inaccurate ones.In addition,the data with improved quality can be used for estimating some crucial parameters in the drilling process which are unmeasurable in the first place,hence provide better model calibrations for integrated well planning and realtime operations.展开更多
Nowadays,several research projects show interest in employing volunteered geographic information(VGI)to improve their systems through using up-to-date and detailed data.The European project CAP4Access is one of the su...Nowadays,several research projects show interest in employing volunteered geographic information(VGI)to improve their systems through using up-to-date and detailed data.The European project CAP4Access is one of the successful examples of such international-wide research projects that aims to improve the accessibility of people with restricted mobility using crowdsourced data.In this project,OpenStreetMap(OSM)is used to extend OpenRouteService,a well-known routing platform.However,a basic challenge that this project tackled was the incompleteness of OSM data with regards to certain information that is required for wheelchair accessibility(e.g.sidewalk information,kerb data,etc.).In this article,we present the results of initial assessment of sidewalk data in OSM at the beginning of the project as well as our approach in awareness raising and using tools for tagging accessibility data into OSM database for enriching the sidewalk data completeness.Several experiments have been carried out in different European cities,and discussion on the results of the experiments as well as the lessons learned are provided.The lessons learned provide recommendations that help in organizing better mapping party events in the future.We conclude by reporting on how and to what extent the OSM sidewalk data completeness in these study areas have benefited from the mapping parties by the end of the project.展开更多
Virtual globes(VGs)allow Internet users to view geographic data of heterogeneous quality created by other users.This article presents a new approach for collecting and visualizing information about the perceived quali...Virtual globes(VGs)allow Internet users to view geographic data of heterogeneous quality created by other users.This article presents a new approach for collecting and visualizing information about the perceived quality of 3D data in VGs.It aims atimproving users’awareness of the qualityof 3D objects.Instead of relying onthe existing metadata or on formal accuracy assessments that are often impossible in practice,we propose a crowd-sourced quality recommender system based on the five-star visualization method successful in other types of Web applications.Four alternative five-star visualizations were implemented in a Google Earth-based prototype and tested through a formal user evaluation.These tests helped identifying the most effective method for a 3D environment.Results indicate that while most websites use a visualization approach that shows a‘number of stars’,this method was the least preferred by participants.Instead,participants ranked the‘number within a star’method highest as it allowed reducing the visual clutter in urban settings,suggesting that 3D environments such as VGs require different designapproachesthan2Dornon-geographicapplications.Resultsalsoconfirmed that expert and non-expert users in geographic data share similar preferences for the most and least preferred visualization methods.展开更多
基金The National Natural Science Foundation of China (No.70772021,70372004)China Postdoctoral Science Foundation (No.20060400077)
文摘This paper presents a methodology to determine three data quality (DQ) risk characteristics: accuracy, comprehensiveness and nonmembership. The methodology provides a set of quantitative models to confirm the information quality risks for the database of the geographical information system (GIS). Four quantitative measures are introduced to examine how the quality risks of source information affect the quality of information outputs produced using the relational algebra operations Selection, Projection, and Cubic Product. It can be used to determine how quality risks associated with diverse data sources affect the derived data. The GIS is the prime source of information on the location of cables, and detection time strongly depends on whether maps indicate the presence of cables in the construction business. Poor data quality in the GIS can contribute to increased risk or higher risk avoidance costs. A case study provides a numerical example of the calculation of the trade-offs between risk and detection costs and provides an example of the calculation of the costs of data quality. We conclude that the model contributes valuable new insight.
基金This work is supported by the NSFC(Nos.61772280,61772454)the Changzhou Sci&Tech Program(No.CJ20179027)the PAPD fund from NUIST.Prof.Jin Wang is the corresponding author。
文摘Since the British National Archive put forward the concept of the digital continuity in 2007,several developed countries have worked out their digital continuity action plan.However,the technologies of the digital continuity guarantee are still lacked.At first,this paper analyzes the requirements of digital continuity guarantee for electronic record based on data quality theory,then points out the necessity of data quality guarantee for electronic record.Moreover,we convert the digital continuity guarantee of electronic record to ensure the consistency,completeness and timeliness of electronic record,and construct the first technology framework of the digital continuity guarantee for electronic record.Finally,the temporal functional dependencies technology is utilized to build the first integration method to insure the consistency,completeness and timeliness of electronic record.
基金the National Natural Science Foundation of China (60503024 50634010).
文摘One of the goals of data collection is preparing for decision-making, so high quality requirement must be satisfied. Rational evaluation of data quality is an effective way to identify data problem in time, and the quality of data after this evaluation is satisfactory with the requirement of decision maker. A fuzzy neural network based research method of data quality evaluation is proposed. First, the criteria for the evaluation of data quality are selected to construct the fuzzy sets of evaluating grades, and then by using the learning ability of NN, the objective evaluation of membership is carried out, which can be used for the effective evaluation of data quality. This research has been used in the platform of 'data report of national compulsory education outlay guarantee' from the Chinese Ministry of Education. This method can be used for the effective evaluation of data quality worldwide, and the data quality situation can be found out more completely, objectively, and in better time by using the method.
文摘<span style="font-family:Verdana;">Most GIS databases contain data errors. The quality of the data sources such as traditional paper maps or more recent remote sensing data determines spatial data quality. In the past several decades, different statistical measures have been developed to evaluate data quality for different types of data, such as nominal categorical data, ordinal categorical data and numerical data. Although these methods were originally proposed for medical research or psychological research, they have been widely used to evaluate spatial data quality. In this paper, we first review statistical methods for evaluating data quality, discuss under what conditions we should use them and how to interpret the results, followed by a brief discussion of statistical software and packages that can be used to compute these data quality measures.</span>
文摘In contrast with the research of new models,little attention has been paid to the impact of low or high-quality data feeding a dialogue system.The present paper makes thefirst attempt tofill this gap by extending our previous work on question-answering(QA)systems by investigating the effect of misspelling on QA agents and how context changes can enhance the responses.Instead of using large language models trained on huge datasets,we propose a method that enhances the model's score by modifying only the quality and structure of the data feed to the model.It is important to identify the features that modify the agent performance because a high rate of wrong answers can make the students lose their interest in using the QA agent as an additional tool for distant learning.The results demonstrate the accuracy of the proposed context simplification exceeds 85%.Thesefindings shed light on the importance of question data quality and context complexity construct as key dimensions of the QA system.In conclusion,the experimental results on questions and contexts showed that controlling and improving the various aspects of data quality around the QA system can significantly enhance his robustness and performance.
文摘Several organizations have migrated to the cloud for better quality in business engagements and security. Data quality is crucial in present-day activities. Information is generated and collected from data representing real-time facts and activities. Poor data quality affects the organizational decision-making policy and customer satisfaction, and influences the organization’s scheme of execution negatively. Data quality also has a massive influence on the accuracy, complexity and efficiency of the machine and deep learning tasks. There are several methods and tools to evaluate data quality to ensure smooth incorporation in model development. The bulk of data quality tools permit the assessment of sources of data only at a certain point in time, and the arrangement and automation are consequently an obligation of the user. In ensuring automatic data quality, several steps are involved in gathering data from different sources and monitoring data quality, and any problems with the data quality must be adequately addressed. There was a gap in the literature as no attempts have been made previously to collate all the advances in different dimensions of automatic data quality. This limited narrative review of existing literature sought to address this gap by correlating different steps and advancements related to the automatic data quality systems. The six crucial data quality dimensions in organizations were discussed, and big data were compared and classified. This review highlights existing data quality models and strategies that can contribute to the development of automatic data quality systems.
文摘Today, the quantity of data continues to increase, furthermore, the data are heterogeneous, from multiple sources (structured, semi-structured and unstructured) and with different levels of quality. Therefore, it is very likely to manipulate data without knowledge about their structures and their semantics. In fact, the meta-data may be insufficient or totally absent. Data Anomalies may be due to the poverty of their semantic descriptions, or even the absence of their description. In this paper, we propose an approach to better understand the semantics and the structure of the data. Our approach helps to correct automatically the intra-column anomalies and the inter-col- umns ones. We aim to improve the quality of data by processing the null values and the semantic dependencies between columns.
文摘Background: High data quality provides correct and up-to-date information which is critical to ensure, not only for the maintenance of health care at an optimal level, but also for the provision of high-quality clinical care, continuing health care, clinical and health service research, and planning and management of health systems. For the attainment of achievable improvements in the health sector, good data is core. Aim/Objective: To assess the level of knowledge and practices of Community Health Nurses on data quality in the Ho municipality, Ghana. Methods: A descriptive cross-sectional study was employed for the study, using a standard Likert scale questionnaire. A census was used to collect 77 Community Health Nurses’ information. The statistical software, Epi-Data 3.1 was used to enter the data and exported to STATA 12.0 for the analyses. Chi-square and logistic analyses were performed to establish associations between categorical variables and a p-value of less than 0.05 at 95% significance interval was considered statistically significant. Results: Out of the 77 Community Health Nurses studied, 49 (63.64%) had good knowledge on data accuracy, 51 (66.23%) out of the 77 Community Health Nurses studied had poor knowledge on data completeness, and 64 (83.12%) had poor knowledge on data timeliness out of the 77 studied. Also, 16 (20.78%) and 33 (42.86%) of the 77 Community Health Nurses responded there was no designated staff for data quality review and no feedback from the health directorate respectively. Out of the 16 health facilities studied for data quality practices, half (8, 50.00%) had missing values on copies of their previous months’ report forms. More so, 10 (62.50%) had no reminders (monthly data submission itineraries) at the facility level. Conclusion: Overall, the general level of knowledge of Community Health Nurses on data quality was poor and their practices for improving data quality at the facility level were woefully inadequate. Therefore, Community Health Nurses need to be given on-job training and proper education on data quality and its dimensions. Also, the health directorate should intensify its continuous supportive supervisory visits at all facilities and feedback should be given to the Community Health Nurses on the data submitted.
基金This paper is supported by National Natural Science Foundation of China under Grant No.50335030
文摘By researching the data quality problem in the monitoring and diagnosis system (MDS) , the method of detecting non-condition data based on the development trend of equipment condition is proposed, and three requirements of criteria for detecting non-condition data: dynamic, syntheses and simplicity are discussed. According to the general mode of data management in MDS, a data quality assurance system (DQAS) comprising data quality monitoring, data quality diagnosis, detection criteria adjusting and artificial confirmation is set up. A route inspection system called MTREE realizes the DQAS. Aiming at vibration data of route inspection, two detecting criteria are made. One is the quality monitoring parameter, which is found through combining and optimizing some fundamental parameters by genetic programming (GP). The other is the quality diagnosis criterion, i. e. pseudo distance of Spectral-Energy-Vector (SEV) named Adjacent J-divergence, which indicates the variation degree of adjacent data's spectral energy distribution. Results show that DQAS, including these two criteria, is effective to improve the data quality of MDS.
文摘The in-orbit commissioning of ZY-1 02C satellite is proceeding smoothly. According to the relevant experts in this field, the imagery quality of the satellite has reached or nearly reached the level of international satellites of the same kind. ZY-1 02C satellite and ZY-3 satellite were successfully launched on December 22, 2011 and January 9, 2012 respectively. China Centre for Resources Satellite Data andApplication (CRSDA) was responsible for the building of a ground
基金the Scientific Research Funding Project of Liaoning Education Department of China under Grant No.JDL2020005,No.LJKZ0485the National Key Research and Development Program of China under Grant No.2018YFA0704605.
文摘Nowadays,data are more and more used for intelligent modeling and prediction,and the comprehensive evaluation of data quality is getting more and more attention as a necessary means to measure whether the data are usable or not.However,the comprehensive evaluation method of data quality mostly contains the subjective factors of the evaluator,so how to comprehensively and objectively evaluate the data has become a bottleneck that needs to be solved in the research of comprehensive evaluation method.In order to evaluate the data more comprehensively,objectively and differentially,a novel comprehensive evaluation method based on particle swarm optimization(PSO)and grey correlation analysis(GCA)is presented in this paper.At first,an improved GCA evaluation model based on the technique for order preference by similarity to an ideal solution(TOPSIS)is proposed.Then,an objective function model of maximum difference of the comprehensive evaluation values is built,and the PSO algorithm is used to optimize the weights of the improved GCA evaluation model based on the objective function model.Finally,the performance of the proposed method is investigated through parameter analysis.A performance comparison of traffic flow data is carried out,and the simulation results show that the maximum average difference between the evaluation results and its mean value(MDR)of the proposed comprehensive evaluation method is 33.24%higher than that of TOPSIS-GCA,and 6.86%higher than that of GCA.The proposed method has better differentiation than other methods,which means that it objectively and comprehensively evaluates the data from both the relevance and differentiation of the data,and the results more effectively reflect the differences in data quality,which will provide more effective data support for intelligent modeling,prediction and other applications.
文摘Timestamps play a key role in process mining because it determines the chronology of which events occurred and subsequently how they are ordered in process modelling.The timestamp in process mining gives an insight on process performance,conformance,and modelling.This therefore means problems with the timestamp will result in misrepresentations of the mined process.A few articles have been published on the quantification of data quality problems but just one of the articles at the time of this paper is based on the quantification of timestamp quality problems.This article evaluates the quality of timestamps in event log across two axes using eleven quality dimensions and four levels of potential data quality problems.The eleven data quality dimensions were obtained by doing a thorough literature review of more than fifty process mining articles which focus on quality dimensions.This evaluation resulted in twelve data quality quantification metrics and the metrics were applied to the MIMIC-ll dataset as an illustration.The outcome of the timestamp quality quantification using the proposed typology enabled the user to appreciate the quality of the event log and thus makes it possible to evaluate the risk of carrying out specific data cleaning measures to improve the process mining outcome.
文摘Real world study (RWS) has become a hotspot for clinical research. Data quality plays a vital role in research achievement and other clinical research fields. In this paper, the common quality problems in the RWS of traditional Chinese medicine are discussed, and a countermeasure is proposed.
基金the National Natural Science Foundation of China under Grant No.40775021the National"863"Project"Research on Application System of the Airborne Radar"+1 种基金the China Meteorological Administration Project"Tropical West Pacific Ocean Observation and Predictability"the National Key Basic Research and Development Program of China under Grant No.2004CB418305.
文摘A C-band mobile polarimetric radar with simultaneous horizontal and vertical transmission was built in the State Key Laboratory of Severe Weather, Chinese Academy of Meteorological Sciences. It was used in heavy rainfall and typhoon observations in 2008. It is well-known that radar calibration is essential and critical to high quality radar data and products. In this paper, the test and weather signals were used in calibration of reflectivity ZH, differential reflectivity ZDR and differential phase ФDP. Noise effects on correlation coefficient ρHV at low signal-noise-ratio (SNR) were analyzed. The polarimetric radar data for a heavy rain and a snow event were inspected to evaluate the performance of the calibration method and radar data quality, and S-band Doppler radar data were used to validate the refiectivity data quality collected by the polarimetric radar. The results show that the polarimetric and S-band Doppler radars have observed comparable reflectivity values and a similar structure of a heavy rainfall case at middle and low levels. The mismatch of two receivers produce obvious ZDR biases, which were verified by the radar data observed at vertical incidence. The ZDR correction improved the radar data quality. The usage range for PHV was defined. Application of the calibration method introduced in this paper can reduce the system biases caused by the difference of horizontal (H) and vertical (V) channels. After the calibration and correction, the polarimetric parameters observed by the polarimetric radar could be used in further relevant researches.
基金supported in part by the U.S.National Science Foundation(U.S.NSF)through the U.S.NSF/Department of Energy(DOE)Engineering Research Center Program under Award EEC-1041877 for CURENT
文摘Synchrophasor systems, providing low-latency,high-precision, and time-synchronized measurements to enhance power grid performances, are deployed globally.However, the synchrophasor system as a physical network,involves communication constraints and data quality issues, which will impact or even disable certain synchrophasor applications. This work investigates the data quality issue for synchrophasor applications. In Part I, the standards of synchrophasor systems and the classifications and data quality requirements of synchrophasor applications are reviewed. Also, the actual events of synchronization signal accuracy, synchrophasor data loss, and latency are counted and analyzed. The review and statistics are expected to provide an overall picture of data accuracy,loss, and latency issues for synchrophasor applications.
基金This research was funded by National Key R&D Program of China(No.SQ2018YFB100002)the National Natural Science Foundation of China(No.s 61761136020,61672308)+5 种基金Microsoft Research Asia,Fraunhofer Cluster of Excellence on"Cognitive Internet Technologies",EU through project Track&Know(grant agreement 780754)NSFC(61761136020)NSFC-Zhejiang Joint Fund for the Integration of Industrialization and Informatization(U1609217)Zhejiang Provincial Natural Science Foundation(LR18F020001)NSFC Grants 61602306Fundamental Research Funds for the Central Universities。
文摘Data quality management,especially data cleansing,has been extensively studied for many years in the areas of data management and visual analytics.In the paper,we first review and explore the relevant work from the research areas of data management,visual analytics and human-computer interaction.Then for different types of data such as multimedia data,textual data,trajectory data,and graph data,we summarize the common methods for improving data quality by leveraging data cleansing techniques at different analysis stages.Based on a thorough analysis,we propose a general visual analytics framework for interactively cleansing data.Finally,the challenges and opportunities are analyzed and discussed in the context of data and humans.
基金supported by the National Natural Science Foundation of China(51734004 and 51704069).
文摘The real-time energy flow data obtained in industrial production processes are usually of low quality.It is difficult to accurately predict the short-term energy flow profile by using these field data,which diminishes the effect of industrial big data and artificial intelligence in industrial energy system.The real-time data of blast furnace gas(BFG)generation collected in iron and steel sites are also of low quality.In order to tackle this problem,a three-stage data quality improvement strategy was proposed to predict the BFG generation.In the first stage,correlation principle was used to test the sample set.In the second stage,the original sample set was rectified and updated.In the third stage,Kalman filter was employed to eliminate the noise of the updated sample set.The method was verified by autoregressive integrated moving average model,back propagation neural network model and long short-term memory model.The results show that the prediction model based on the proposed three-stage data quality improvement method performs well.Long short-term memory model has the best prediction performance,with a mean absolute error of 17.85 m3/min,a mean absolute percentage error of 0.21%,and an R squared of 95.17%.
基金supported by University of Stavanger, NorwaySINTEF,the Center for Integrated Operations in the Petroleum Industry and the management of National Oilwell Varco Intelli Serv
文摘Wired drill pipe(WDP)technology is one of the most promising data acquisition technologies in today s oil and gas industry.For the first time it allows sensors to be positioned along the drill string which enables collecting and transmitting valuable data not only from the bottom hole assembly(BHA),but also along the entire length of the wellbore to the drill floor.The technology has received industry acceptance as a viable alternative to the typical logging while drilling(LWD)method.Recently more and more WDP applications can be found in the challenging drilling environments around the world,leading to many innovations to the industry.Nevertheless most of the data acquired from WDP can be noisy and in some circumstances of very poor quality.Diverse factors contribute to the poor data quality.Most common sources include mis-calibrated sensors,sensor drifting,errors during data transmission,or some abnormal conditions in the well,etc.The challenge of improving the data quality has attracted more and more focus from many researchers during the past decade.This paper has proposed a promising solution to address such challenge by making corrections of the raw WDP data and estimating unmeasurable parameters to reveal downhole behaviors.An advanced data processing method,data validation and reconciliation(DVR)has been employed,which makes use of the redundant data from multiple WDP sensors to filter/remove the noise from the measurements and ensures the coherence of all sensors and models.Moreover it has the ability to distinguish the accurate measurements from the inaccurate ones.In addition,the data with improved quality can be used for estimating some crucial parameters in the drilling process which are unmeasurable in the first place,hence provide better model calibrations for integrated well planning and realtime operations.
基金supported by the European Community’s Seventh Framework Programme[FP7/2007–2013],[Grant No 612096(CAP4Access)].
文摘Nowadays,several research projects show interest in employing volunteered geographic information(VGI)to improve their systems through using up-to-date and detailed data.The European project CAP4Access is one of the successful examples of such international-wide research projects that aims to improve the accessibility of people with restricted mobility using crowdsourced data.In this project,OpenStreetMap(OSM)is used to extend OpenRouteService,a well-known routing platform.However,a basic challenge that this project tackled was the incompleteness of OSM data with regards to certain information that is required for wheelchair accessibility(e.g.sidewalk information,kerb data,etc.).In this article,we present the results of initial assessment of sidewalk data in OSM at the beginning of the project as well as our approach in awareness raising and using tools for tagging accessibility data into OSM database for enriching the sidewalk data completeness.Several experiments have been carried out in different European cities,and discussion on the results of the experiments as well as the lessons learned are provided.The lessons learned provide recommendations that help in organizing better mapping party events in the future.We conclude by reporting on how and to what extent the OSM sidewalk data completeness in these study areas have benefited from the mapping parties by the end of the project.
文摘Virtual globes(VGs)allow Internet users to view geographic data of heterogeneous quality created by other users.This article presents a new approach for collecting and visualizing information about the perceived quality of 3D data in VGs.It aims atimproving users’awareness of the qualityof 3D objects.Instead of relying onthe existing metadata or on formal accuracy assessments that are often impossible in practice,we propose a crowd-sourced quality recommender system based on the five-star visualization method successful in other types of Web applications.Four alternative five-star visualizations were implemented in a Google Earth-based prototype and tested through a formal user evaluation.These tests helped identifying the most effective method for a 3D environment.Results indicate that while most websites use a visualization approach that shows a‘number of stars’,this method was the least preferred by participants.Instead,participants ranked the‘number within a star’method highest as it allowed reducing the visual clutter in urban settings,suggesting that 3D environments such as VGs require different designapproachesthan2Dornon-geographicapplications.Resultsalsoconfirmed that expert and non-expert users in geographic data share similar preferences for the most and least preferred visualization methods.