Data quality management,especially data cleansing,has been extensively studied for many years in the areas of data management and visual analytics.In the paper,we first review and explore the relevant work from the re...Data quality management,especially data cleansing,has been extensively studied for many years in the areas of data management and visual analytics.In the paper,we first review and explore the relevant work from the research areas of data management,visual analytics and human-computer interaction.Then for different types of data such as multimedia data,textual data,trajectory data,and graph data,we summarize the common methods for improving data quality by leveraging data cleansing techniques at different analysis stages.Based on a thorough analysis,we propose a general visual analytics framework for interactively cleansing data.Finally,the challenges and opportunities are analyzed and discussed in the context of data and humans.展开更多
Data quality is an important aspect in data application and management, and currency is one of the major dimensions influencing its quality. In real applications, datasets timestamps are often incomplete and unavailab...Data quality is an important aspect in data application and management, and currency is one of the major dimensions influencing its quality. In real applications, datasets timestamps are often incomplete and unavailable, or even absent. With the increasing requirements to update real-time data, existing methods can fail to adequately determine the currency of entities. In consideration of the velocity of big data, we propose a series of efficient algorithms for determining the currency of dynamic datasets, which we divide into two steps. In the preprocessing step, to better determine data currency and accelerate dataset updating, we propose the use of a topological graph of the processing order of the entity attributes. Then, we construct an Entity Query B-Tree (EQB-Tree) structure and an Entity Storage Dynamic Linked List (ES-DLL) to improve the querying and updating processes of both the data currency graph and currency scores. In the currency determination step, we propose definitions of the currency score and currency information for tuples referring to the same entity and use examples to discuss methods and algorithms for their computation. Based on our experimental results with both real and synthetic data, we verify that our methods can efficiently update data in the correct order of currency.展开更多
Conditional functional dependencies (CFDs) are a critical technique for detecting inconsistencies while they may ignore some potential inconsistencies without considering the content relationship of data. Content-re...Conditional functional dependencies (CFDs) are a critical technique for detecting inconsistencies while they may ignore some potential inconsistencies without considering the content relationship of data. Content-related conditional functional dependencies (CCFDs) are a type of special CFDs, which combine content-related CFDs and detect potential inconsistencies by putting content-related data together. In the process of cleaning inconsistencies, detection and repairing are interactive: 1) detection catches inconsistencies, 2) repairing corrects caught inconsistencies while may bring new incon- sistencies. Besides, data are often fragmented and distributed into multiple sites. It consequently costs expensive shipment for inconsistencies cleaning. In this paper, our aim is to repair inconsistencies in distributed content-related data. We propose a framework consisting of an inconsistencies detection method and an inconsistencies repairing method, which work iteratively. The detection method marks the violated CCFDs for computing the inconsistencies which should be repaired preferentially. Based on the repairing-cost model presented in this paper, we prove that the minimum-cost repairing using CCFDs is NP-complete. Therefore, the repairing method heuristically repairs the inconsistencies with minimum cost. To improve the efficiency and accuracy of repairing, we propose distinct values and rules sequences. Distinct values make less data shipments than real data for communication. Rules sequences determine appropriate repairing sequences to avoid some incorrect repairs. Our solution is proved to be more effective than CFDs by empirical evaluation on two real-life datasets.展开更多
基金This research was funded by National Key R&D Program of China(No.SQ2018YFB100002)the National Natural Science Foundation of China(No.s 61761136020,61672308)+5 种基金Microsoft Research Asia,Fraunhofer Cluster of Excellence on"Cognitive Internet Technologies",EU through project Track&Know(grant agreement 780754)NSFC(61761136020)NSFC-Zhejiang Joint Fund for the Integration of Industrialization and Informatization(U1609217)Zhejiang Provincial Natural Science Foundation(LR18F020001)NSFC Grants 61602306Fundamental Research Funds for the Central Universities。
文摘Data quality management,especially data cleansing,has been extensively studied for many years in the areas of data management and visual analytics.In the paper,we first review and explore the relevant work from the research areas of data management,visual analytics and human-computer interaction.Then for different types of data such as multimedia data,textual data,trajectory data,and graph data,we summarize the common methods for improving data quality by leveraging data cleansing techniques at different analysis stages.Based on a thorough analysis,we propose a general visual analytics framework for interactively cleansing data.Finally,the challenges and opportunities are analyzed and discussed in the context of data and humans.
基金supported by the National Natural Science Foundation of China(Nos.U1509216 and 61472099)National Key Technology Research and Development Program(No.2015BAH10F01)+1 种基金the Scientific Research Foundation for the Returned Overseas Chinese Scholars of Heilongjiang Province(No.LC2016026)MOE-Microsoft Key Laboratory of Natural Language Processing and Speech,Harbin Institute of Technology
文摘Data quality is an important aspect in data application and management, and currency is one of the major dimensions influencing its quality. In real applications, datasets timestamps are often incomplete and unavailable, or even absent. With the increasing requirements to update real-time data, existing methods can fail to adequately determine the currency of entities. In consideration of the velocity of big data, we propose a series of efficient algorithms for determining the currency of dynamic datasets, which we divide into two steps. In the preprocessing step, to better determine data currency and accelerate dataset updating, we propose the use of a topological graph of the processing order of the entity attributes. Then, we construct an Entity Query B-Tree (EQB-Tree) structure and an Entity Storage Dynamic Linked List (ES-DLL) to improve the querying and updating processes of both the data currency graph and currency scores. In the currency determination step, we propose definitions of the currency score and currency information for tuples referring to the same entity and use examples to discuss methods and algorithms for their computation. Based on our experimental results with both real and synthetic data, we verify that our methods can efficiently update data in the correct order of currency.
基金This research was supported by the National Basic Research 973 Program of China under Grant No. 2012CB316201, the National Natural Science Foundation of China under Grant Nos. 61033007 and 61472070, and the Fundamental Research Funds for the Central Universities of China under Grant No. N150408001-3.
文摘Conditional functional dependencies (CFDs) are a critical technique for detecting inconsistencies while they may ignore some potential inconsistencies without considering the content relationship of data. Content-related conditional functional dependencies (CCFDs) are a type of special CFDs, which combine content-related CFDs and detect potential inconsistencies by putting content-related data together. In the process of cleaning inconsistencies, detection and repairing are interactive: 1) detection catches inconsistencies, 2) repairing corrects caught inconsistencies while may bring new incon- sistencies. Besides, data are often fragmented and distributed into multiple sites. It consequently costs expensive shipment for inconsistencies cleaning. In this paper, our aim is to repair inconsistencies in distributed content-related data. We propose a framework consisting of an inconsistencies detection method and an inconsistencies repairing method, which work iteratively. The detection method marks the violated CCFDs for computing the inconsistencies which should be repaired preferentially. Based on the repairing-cost model presented in this paper, we prove that the minimum-cost repairing using CCFDs is NP-complete. Therefore, the repairing method heuristically repairs the inconsistencies with minimum cost. To improve the efficiency and accuracy of repairing, we propose distinct values and rules sequences. Distinct values make less data shipments than real data for communication. Rules sequences determine appropriate repairing sequences to avoid some incorrect repairs. Our solution is proved to be more effective than CFDs by empirical evaluation on two real-life datasets.