Big data are always processed repeatedly with small changes, which is a major form of big data processing. The feature of incremental change of big data shows that incremental computing mode can improve the performanc...Big data are always processed repeatedly with small changes, which is a major form of big data processing. The feature of incremental change of big data shows that incremental computing mode can improve the performance greatly. HDFS is a distributed file system on Hadoop which is the most popular platform for big data analytics. And HDFS adopts fixed-size chunking policy, which is inefficient facing incremental computing. Therefore, in this paper, we proposed iHDFS (incremental HDFS), a distributed file system, which can provide basic guarantee for big data parallel processing. The iHDFS is implemented as an extension to HDFS. In iHDFS, Rabin fingerprint algorithm is applied to achieve content defined chunking. This policy make data chunking has much higher stability, and the intermediate processing results can be reused efficiently, so the performance of incremental data processing can be improved significantly. The effectiveness and efficiency of iHDFS have been demonstrated by the experimental results.展开更多
Entity resolution is a key aspect in data quality and data integration, identifying which records correspond to the same real world entity in data sources. Many existing ap- proaches require manually designed match ru...Entity resolution is a key aspect in data quality and data integration, identifying which records correspond to the same real world entity in data sources. Many existing ap- proaches require manually designed match rules to solve the problem, which always needs domain knowledge and is time consuming. We propose a novel genetic algorithm based en- tity resolution approach via active learning. It is able to learn effective match rules by logically combining several different attributes' comparisons with proper thresholds. We use ac- tive learning to reduce manually labeled data and speed up the learning process. The extensive evaluation shows that the proposed approach outperforms the sate-of-the-art entity res- olution approaches in accuracy.展开更多
Conditional functional dependencies(CFDs) are important techniques for data consistency. However, CFDs are limited to 1) provide the reasonable values for consistency repairing and 2) detect potential errors. This...Conditional functional dependencies(CFDs) are important techniques for data consistency. However, CFDs are limited to 1) provide the reasonable values for consistency repairing and 2) detect potential errors. This paper presents context-aware conditional functional dependencies(CCFDs) which contribute to provide reasonable values and detect po- tential errors. Especially, we focus on automatically discov- ering minimal CCFDs. In this paper, we present context rela- tivity to measure the relationship of CFDs. The overlap of the related CFDs can provide reasonable values which result in more accuracy consistency repairing, and some related CFDs are combined into CCFDs. Moreover, we prove that discover- ing minimal CCFDs is NP-complete and we design the pre- cise method and the heuristic method. We also present the dominating value to facilitate the process in both the precise method and the heuristic method. Additionally, the context relativity of the CFDs affects the cleaning results. We will give an approximate threshold of context relativity accord- ing to data distribution for suggestion. The repairing results are approved more accuracy, even evidenced by our empirical evaluation.展开更多
1 Introduction Biomedical entity linking aims to align the natural language entity to the knowledge base concept referring to the same real-world object.Recent solutions in the biomedical field focus on the embedding-...1 Introduction Biomedical entity linking aims to align the natural language entity to the knowledge base concept referring to the same real-world object.Recent solutions in the biomedical field focus on the embedding-based method to jointly model the texts and the knowledge graph into a multi-dimensional entity space.However,the task of a biomedical entity linking is still facing tough challenges:1)The ambiguity of natural language descriptions,including polysemy and abbreviation.展开更多
Information networks provide a powerful representation of entities and the relationships between them.Information networks fusion is a technique for information fusion that jointly reasons about entities,links and rel...Information networks provide a powerful representation of entities and the relationships between them.Information networks fusion is a technique for information fusion that jointly reasons about entities,links and relations in the presence of various sources.However,existing methods for information networks fusion tend to rely on a single task which might not get enough evidence for reasoning.In order to solve this issue,in this paper,we present a novel model called MC-INFM(information networks fusion model based on multi-task coordination).Different from traditional models,MC-INFM casts the fusion problem as a probabilistic inference problem,and collectively performs multiple tasks(including entity resolution,link prediction and relation matching)to infer the final result of fusion.First,we define the intra-features and the inter-features respectively and model them as factor graphs,which can provide abundant evidence to infer.Then,we use conditional random field(CRF)to learn the weight of each feature and infer the results of these tasks simultaneously by performing the maximum probabilistic inference.Experiments demonstrate the effectiveness of our proposed model.展开更多
1Introduction Record linkage(RL)groups_records corresponding to the same entities in datasets,and is a long-standing topic in data management and mining communities[1-2].In big data era,real-time data applications bec...1Introduction Record linkage(RL)groups_records corresponding to the same entities in datasets,and is a long-standing topic in data management and mining communities[1-2].In big data era,real-time data applications become popular,and callfor payas-you-go RL(PRL),which produces as many match pairs as possible in very limited time(much shorter than the overall RLruntime).展开更多
Biomedical entity alignment,composed of two subtasks:entity identification and entity-concept mapping,is of great research value in biomedical text mining while these techniques are widely used for name entity standar...Biomedical entity alignment,composed of two subtasks:entity identification and entity-concept mapping,is of great research value in biomedical text mining while these techniques are widely used for name entity standardization,information retrieval,knowledge acquisition and ontology construc-tion.Previous works made many efforts on feature engineering to employ feature-based models for entity identification and alignment.However,the models depended on subjective feature selection may suffer error propagation and are not able to uti-lize the hidden information.With rapid development in health-related research,researchers need an effective method to explore the large amount of available biomedical literatures.Therefore,we propose a two-stage entity alignment process,biomedical entity exploring model,to identify biomedical entities and align them to the knowledge base interactively.The model aims to automatically obtain semantic information for extracting biomedical entities and mining semantic relations through the standard biomedical knowledge base.The experiments show that the proposed method achieves better performance on entity alignment.The proposed model dramatically improves the FI scores of the task by about 4.5%in entity identification and 2.5%in entity-concept mapping.展开更多
1 Introduction and main contributions Private entity matching(PEM)[1]is to find records from two or more data sources that refer to the same or similar individuals,without revealing other information besides the match...1 Introduction and main contributions Private entity matching(PEM)[1]is to find records from two or more data sources that refer to the same or similar individuals,without revealing other information besides the matched records.There have been numerous work done for PEM.展开更多
A lot of scholars have focused on developing effective techniques for package queries, and a lot of excellent approaches have been proposed. Unfortunately, most of the existing methods focus on a small volume of data....A lot of scholars have focused on developing effective techniques for package queries, and a lot of excellent approaches have been proposed. Unfortunately, most of the existing methods focus on a small volume of data. The rapid increase in data volume means that traditional methods of package queries find it difficult to meet the increasing requirements. To solve this problem, a novel optimization method of package queries(HPPQ) is proposed in this paper. First, the data is preprocessed into regions. Data preprocessing segments the dataset into multiple subsets and the centroid of the subsets is used for package queries, this effectively reduces the volume of candidate results. Furthermore, an efficient heuristic algorithm is proposed(namely IPOL-HS) based on the preprocessing results. This improves the quality of the candidate results in the iterative stage and improves the convergence rate of the heuristic algorithm. Finally, a strategy called HPR is proposed, which relies on a greedy algorithm and parallel processing to accelerate the rate of query. The experimental results show that our method can significantly reduce time consumption compared with existing methods.展开更多
文摘Big data are always processed repeatedly with small changes, which is a major form of big data processing. The feature of incremental change of big data shows that incremental computing mode can improve the performance greatly. HDFS is a distributed file system on Hadoop which is the most popular platform for big data analytics. And HDFS adopts fixed-size chunking policy, which is inefficient facing incremental computing. Therefore, in this paper, we proposed iHDFS (incremental HDFS), a distributed file system, which can provide basic guarantee for big data parallel processing. The iHDFS is implemented as an extension to HDFS. In iHDFS, Rabin fingerprint algorithm is applied to achieve content defined chunking. This policy make data chunking has much higher stability, and the intermediate processing results can be reused efficiently, so the performance of incremental data processing can be improved significantly. The effectiveness and efficiency of iHDFS have been demonstrated by the experimental results.
基金The authors thank anonymous reviewers for their in- spiting doubts and helpful suggestions during the reviewing process. This work was supported by the National Basic Research Program of China (973 Program) (2012CB316201), the Fundamental Research Funds for the Cen- tral Universities (N 120816001) and the National Natural Science Foundation of China (Grant Nos. 61472070, 61402213).
文摘Entity resolution is a key aspect in data quality and data integration, identifying which records correspond to the same real world entity in data sources. Many existing ap- proaches require manually designed match rules to solve the problem, which always needs domain knowledge and is time consuming. We propose a novel genetic algorithm based en- tity resolution approach via active learning. It is able to learn effective match rules by logically combining several different attributes' comparisons with proper thresholds. We use ac- tive learning to reduce manually labeled data and speed up the learning process. The extensive evaluation shows that the proposed approach outperforms the sate-of-the-art entity res- olution approaches in accuracy.
文摘Conditional functional dependencies(CFDs) are important techniques for data consistency. However, CFDs are limited to 1) provide the reasonable values for consistency repairing and 2) detect potential errors. This paper presents context-aware conditional functional dependencies(CCFDs) which contribute to provide reasonable values and detect po- tential errors. Especially, we focus on automatically discov- ering minimal CCFDs. In this paper, we present context rela- tivity to measure the relationship of CFDs. The overlap of the related CFDs can provide reasonable values which result in more accuracy consistency repairing, and some related CFDs are combined into CCFDs. Moreover, we prove that discover- ing minimal CCFDs is NP-complete and we design the pre- cise method and the heuristic method. We also present the dominating value to facilitate the process in both the precise method and the heuristic method. Additionally, the context relativity of the CFDs affects the cleaning results. We will give an approximate threshold of context relativity accord- ing to data distribution for suggestion. The repairing results are approved more accuracy, even evidenced by our empirical evaluation.
基金the National Natural Science Foundation of China(62072084,62072086,62172082)the National Defense Basic Scien tific Research Program of China(JCKY2018205C012)the Fundamental Research Funds for the Central Universities(N2116008).
文摘1 Introduction Biomedical entity linking aims to align the natural language entity to the knowledge base concept referring to the same real-world object.Recent solutions in the biomedical field focus on the embedding-based method to jointly model the texts and the knowledge graph into a multi-dimensional entity space.However,the task of a biomedical entity linking is still facing tough challenges:1)The ambiguity of natural language descriptions,including polysemy and abbreviation.
基金the National Basic Research Program (973Program)of China (2012CB316201)the National Natural Science Foundation of China (Grant Nos.61033007,61472070).
基金This work was supported by the National Key R&D Program of China(2018YFB1003404)the National Natural Science Foundation of China(Grant Nos.61672142,U1435216,61602103).
文摘Information networks provide a powerful representation of entities and the relationships between them.Information networks fusion is a technique for information fusion that jointly reasons about entities,links and relations in the presence of various sources.However,existing methods for information networks fusion tend to rely on a single task which might not get enough evidence for reasoning.In order to solve this issue,in this paper,we present a novel model called MC-INFM(information networks fusion model based on multi-task coordination).Different from traditional models,MC-INFM casts the fusion problem as a probabilistic inference problem,and collectively performs multiple tasks(including entity resolution,link prediction and relation matching)to infer the final result of fusion.First,we define the intra-features and the inter-features respectively and model them as factor graphs,which can provide abundant evidence to infer.Then,we use conditional random field(CRF)to learn the weight of each feature and infer the results of these tasks simultaneously by performing the maximum probabilistic inference.Experiments demonstrate the effectiveness of our proposed model.
基金supported by the National Natural Science Foundation of China(Grant Nos.62002262,61672142,61602103,62072086,62072084)the National Key Research and Development Project of China(2018YFB1003404).
文摘1Introduction Record linkage(RL)groups_records corresponding to the same entities in datasets,and is a long-standing topic in data management and mining communities[1-2].In big data era,real-time data applications become popular,and callfor payas-you-go RL(PRL),which produces as many match pairs as possible in very limited time(much shorter than the overall RLruntime).
基金supported by the National Key Research and Development Program of China(2018YFB1003404)the National Natural Science Foundation of China(Grant Nos.61672142,61402213)+1 种基金the Fundamental Research Funds for the Central Universities(N150408001-3,N150404013)Natural Science Foundation of Liaoning Province(20170540471)。
文摘Biomedical entity alignment,composed of two subtasks:entity identification and entity-concept mapping,is of great research value in biomedical text mining while these techniques are widely used for name entity standardization,information retrieval,knowledge acquisition and ontology construc-tion.Previous works made many efforts on feature engineering to employ feature-based models for entity identification and alignment.However,the models depended on subjective feature selection may suffer error propagation and are not able to uti-lize the hidden information.With rapid development in health-related research,researchers need an effective method to explore the large amount of available biomedical literatures.Therefore,we propose a two-stage entity alignment process,biomedical entity exploring model,to identify biomedical entities and align them to the knowledge base interactively.The model aims to automatically obtain semantic information for extracting biomedical entities and mining semantic relations through the standard biomedical knowledge base.The experiments show that the proposed method achieves better performance on entity alignment.The proposed model dramatically improves the FI scores of the task by about 4.5%in entity identification and 2.5%in entity-concept mapping.
基金This work was supported by the National Basic Research 973 Program of China(2012CB316201)the National Natural Science Foundation of China(Grant Nos.61472070,61672142,61602103,2018YFB1003404).
文摘1 Introduction and main contributions Private entity matching(PEM)[1]is to find records from two or more data sources that refer to the same or similar individuals,without revealing other information besides the matched records.There have been numerous work done for PEM.
基金supported by the National Natural Science Foundation of China (Nos. 61472070 and 61672142)
文摘A lot of scholars have focused on developing effective techniques for package queries, and a lot of excellent approaches have been proposed. Unfortunately, most of the existing methods focus on a small volume of data. The rapid increase in data volume means that traditional methods of package queries find it difficult to meet the increasing requirements. To solve this problem, a novel optimization method of package queries(HPPQ) is proposed in this paper. First, the data is preprocessed into regions. Data preprocessing segments the dataset into multiple subsets and the centroid of the subsets is used for package queries, this effectively reduces the volume of candidate results. Furthermore, an efficient heuristic algorithm is proposed(namely IPOL-HS) based on the preprocessing results. This improves the quality of the candidate results in the iterative stage and improves the convergence rate of the heuristic algorithm. Finally, a strategy called HPR is proposed, which relies on a greedy algorithm and parallel processing to accelerate the rate of query. The experimental results show that our method can significantly reduce time consumption compared with existing methods.