Semistructured data are specified in lack of any fixed and rigidschema, even though typically some implicit structure appears in the data. Thehuge amounts of on-line applications make it important and imperative to mi...Semistructured data are specified in lack of any fixed and rigidschema, even though typically some implicit structure appears in the data. Thehuge amounts of on-line applications make it important and imperative to mine theschema of semistructured data, both for the users (e.g., to gather useful informationand facilitate querying) and for the systems (e.g., to optimize access). The criticalproblem is to discover the hidden structure in the semistructured data. Currentmethods in extracting Web data structure are either in a general way independentof application background, or bound in some concrete environment such as HTML,XML etc. But both face the burden of expensive cost and difficulty in keeping alongwith the frequent and complicated variances of Web data. In this paper) the problemof incremental mining of schema for semistructured data after the update of the rawdata is discussed. An algorithm for incrementally mining the schema of semistruc-tured data is provided, and some experimental results are also given, which show thatincremental mining for semistructured data is more efficient than non-incrementalmining.展开更多
Many modern applications (e-commerce, digital library, etc.) require inte- grated access to various information sources (from traditional RDBMS to semistructured Web repositories). Extracting schema from semistructure...Many modern applications (e-commerce, digital library, etc.) require inte- grated access to various information sources (from traditional RDBMS to semistructured Web repositories). Extracting schema from semistructured data is a prerequisite to integrate hetero- geneous information sources. The traditional method that extracts global schema may require time (and space) to increase exponentially with the number of objects and edges in the source. A new method is presented in this paper, which is about extracting local schema. In this method, the algorithm controls the scale of extracting schema within the 'schema diameter' by examining the semantic distance of the target set and using the Hash class and its path distance operation. This method is very efficient for restraining schema from expanding. The prototype validates the new approach.展开更多
文摘Semistructured data are specified in lack of any fixed and rigidschema, even though typically some implicit structure appears in the data. Thehuge amounts of on-line applications make it important and imperative to mine theschema of semistructured data, both for the users (e.g., to gather useful informationand facilitate querying) and for the systems (e.g., to optimize access). The criticalproblem is to discover the hidden structure in the semistructured data. Currentmethods in extracting Web data structure are either in a general way independentof application background, or bound in some concrete environment such as HTML,XML etc. But both face the burden of expensive cost and difficulty in keeping alongwith the frequent and complicated variances of Web data. In this paper) the problemof incremental mining of schema for semistructured data after the update of the rawdata is discussed. An algorithm for incrementally mining the schema of semistruc-tured data is provided, and some experimental results are also given, which show thatincremental mining for semistructured data is more efficient than non-incrementalmining.
基金This work is supported by the NKBRSF under grant No.G1999032705.
文摘Many modern applications (e-commerce, digital library, etc.) require inte- grated access to various information sources (from traditional RDBMS to semistructured Web repositories). Extracting schema from semistructured data is a prerequisite to integrate hetero- geneous information sources. The traditional method that extracts global schema may require time (and space) to increase exponentially with the number of objects and edges in the source. A new method is presented in this paper, which is about extracting local schema. In this method, the algorithm controls the scale of extracting schema within the 'schema diameter' by examining the semantic distance of the target set and using the Hash class and its path distance operation. This method is very efficient for restraining schema from expanding. The prototype validates the new approach.