Deep Web sources contain a large of high-quality and query-related structured date. One of the challenges in the Deep Web is extracting result schemas of Deep Web sources. To address this challenge, this paper describ...Deep Web sources contain a large of high-quality and query-related structured date. One of the challenges in the Deep Web is extracting result schemas of Deep Web sources. To address this challenge, this paper describes a novel approach that extracts both result data and the result schema of a Web database. The approach first models the query interface of a Deep Web source and fills in it with a specifically query instance. Then the result pages of the Deep Web sources are formatted in the tree structure to retrieve subtrees that contain elements of the query instance, Next, result schema of the Deep Web source is extracted by matching the subtree' nodes with the query instance, in which, a two-phase schema extraction method is adopted for obtaining more accurate result schema. Finally, experiments on real Deep Web sources show the utility of our approach, which provides a high precision and recall.展开更多
Many modern applications (e-commerce, digital library, etc.) require inte- grated access to various information sources (from traditional RDBMS to semistructured Web repositories). Extracting schema from semistructure...Many modern applications (e-commerce, digital library, etc.) require inte- grated access to various information sources (from traditional RDBMS to semistructured Web repositories). Extracting schema from semistructured data is a prerequisite to integrate hetero- geneous information sources. The traditional method that extracts global schema may require time (and space) to increase exponentially with the number of objects and edges in the source. A new method is presented in this paper, which is about extracting local schema. In this method, the algorithm controls the scale of extracting schema within the 'schema diameter' by examining the semantic distance of the target set and using the Hash class and its path distance operation. This method is very efficient for restraining schema from expanding. The prototype validates the new approach.展开更多
基金Supported by the National Natural Science Foundation of China (60673139, 60473073, 60573090)
文摘Deep Web sources contain a large of high-quality and query-related structured date. One of the challenges in the Deep Web is extracting result schemas of Deep Web sources. To address this challenge, this paper describes a novel approach that extracts both result data and the result schema of a Web database. The approach first models the query interface of a Deep Web source and fills in it with a specifically query instance. Then the result pages of the Deep Web sources are formatted in the tree structure to retrieve subtrees that contain elements of the query instance, Next, result schema of the Deep Web source is extracted by matching the subtree' nodes with the query instance, in which, a two-phase schema extraction method is adopted for obtaining more accurate result schema. Finally, experiments on real Deep Web sources show the utility of our approach, which provides a high precision and recall.
基金This work is supported by the NKBRSF under grant No.G1999032705.
文摘Many modern applications (e-commerce, digital library, etc.) require inte- grated access to various information sources (from traditional RDBMS to semistructured Web repositories). Extracting schema from semistructured data is a prerequisite to integrate hetero- geneous information sources. The traditional method that extracts global schema may require time (and space) to increase exponentially with the number of objects and edges in the source. A new method is presented in this paper, which is about extracting local schema. In this method, the algorithm controls the scale of extracting schema within the 'schema diameter' by examining the semantic distance of the target set and using the Hash class and its path distance operation. This method is very efficient for restraining schema from expanding. The prototype validates the new approach.