Many ontologies have been published on the Semantic Web,to be shared to describe resources.Among them,large ontologies of real-world areas have the scalability problem in presenting semantic technologies such as ontol...Many ontologies have been published on the Semantic Web,to be shared to describe resources.Among them,large ontologies of real-world areas have the scalability problem in presenting semantic technologies such as ontology matching(OM).This either suffers from too long run time or has strong hypotheses on the running environment.To deal with this issue,we propose a three-stage MapReduce-based approach V-Doc+ for matching large ontologies,based on the MapReduce framework and virtual document technique.Specifically,two MapReduce processes are performed in the first stage to extract the textual descriptions of named entities(classes,properties,and instances) and blank nodes,respectively.In the second stage,the extracted descriptions are exchanged with neighbors in Resource Description Framework(RDF) graphs to construct virtual documents.This extraction process also benefits from the MapReduce-based implementation.A word-weight-based partitioning method is proposed in the third stage to conduct parallel similarity calculation using the term frequency-inverse document frequency(TF-IDF) model.Experimental results on two large-scale real datasets and the benchmark testbed from Ontology Alignment Evaluation Initiative(OAEI) are reported,showing that the proposed approach significantly reduces the run time with minor loss in precision and recall.展开更多
Entity resolution (ER) aims to identify whether two entities in an ER task refer to the same real-world thing.Crowd ER uses humans, in addition to machine algorithms, to obtain the truths of ER tasks. However, inacc...Entity resolution (ER) aims to identify whether two entities in an ER task refer to the same real-world thing.Crowd ER uses humans, in addition to machine algorithms, to obtain the truths of ER tasks. However, inaccurate orerroneous results are likely to be generated when humans give unreliable judgments. Previous studies have found thatcorrectly estimating human accuracy or expertise in crowd ER is crucial to truth inference. However, a large number ofthem assume that humans have consistent expertise over all the tasks, and ignore the fact that humans may have variedexpertise on different topics (e.g., music versus sport). In this paper, we deal with crowd ER in the Semantic Web area.We identify multiple topics of ER tasks and model human expertise on different topics. Furthermore, we leverage similartask clustering to enhance the topic modeling and expertise estimation. We propose a probabilistic graphical model thatcomputes ER task similarity, estimates human expertise, and infers the task truths in a unified framework. Our evaluationresults on real-world and synthetic datasets show that, compared with several state-of-the-art approaches, our proposedmodel achieves higher accuracy on the task truth inference and is more consistent with the human real expertise.展开更多
基金supported by the National Natural Science Foundation of China (No.61003018)the Natural Science Foundation of Jiangsu Province,China (No.BK2011189)the National Social Science Foundation of China (No.11AZD121)
文摘Many ontologies have been published on the Semantic Web,to be shared to describe resources.Among them,large ontologies of real-world areas have the scalability problem in presenting semantic technologies such as ontology matching(OM).This either suffers from too long run time or has strong hypotheses on the running environment.To deal with this issue,we propose a three-stage MapReduce-based approach V-Doc+ for matching large ontologies,based on the MapReduce framework and virtual document technique.Specifically,two MapReduce processes are performed in the first stage to extract the textual descriptions of named entities(classes,properties,and instances) and blank nodes,respectively.In the second stage,the extracted descriptions are exchanged with neighbors in Resource Description Framework(RDF) graphs to construct virtual documents.This extraction process also benefits from the MapReduce-based implementation.A word-weight-based partitioning method is proposed in the third stage to conduct parallel similarity calculation using the term frequency-inverse document frequency(TF-IDF) model.Experimental results on two large-scale real datasets and the benchmark testbed from Ontology Alignment Evaluation Initiative(OAEI) are reported,showing that the proposed approach significantly reduces the run time with minor loss in precision and recall.
基金This work was supported by the National Natural Science Foundation of China under Grant Nos. 61872172 and 61772264.
文摘Entity resolution (ER) aims to identify whether two entities in an ER task refer to the same real-world thing.Crowd ER uses humans, in addition to machine algorithms, to obtain the truths of ER tasks. However, inaccurate orerroneous results are likely to be generated when humans give unreliable judgments. Previous studies have found thatcorrectly estimating human accuracy or expertise in crowd ER is crucial to truth inference. However, a large number ofthem assume that humans have consistent expertise over all the tasks, and ignore the fact that humans may have variedexpertise on different topics (e.g., music versus sport). In this paper, we deal with crowd ER in the Semantic Web area.We identify multiple topics of ER tasks and model human expertise on different topics. Furthermore, we leverage similartask clustering to enhance the topic modeling and expertise estimation. We propose a probabilistic graphical model thatcomputes ER task similarity, estimates human expertise, and infers the task truths in a unified framework. Our evaluationresults on real-world and synthetic datasets show that, compared with several state-of-the-art approaches, our proposedmodel achieves higher accuracy on the task truth inference and is more consistent with the human real expertise.