This paper examines automatic recognition and extraction of tables from a large collection of het-erogeneous documents. The heterogeneous documents are initially pre-processed and converted to HTML codes, after which ...This paper examines automatic recognition and extraction of tables from a large collection of het-erogeneous documents. The heterogeneous documents are initially pre-processed and converted to HTML codes, after which an algorithm recognises the table portion of the documents. Hidden Markov Model (HMM) is then applied to the HTML code in order to extract the tables. The model was trained and tested with five hundred and twenty six self-generated tables (three hundred and twenty-one (321) tables for training and two hundred and five (205) tables for testing). Viterbi algorithm was implemented for the testing part. The system was evaluated in terms of accuracy, precision, recall and f-measure. The overall evaluation results show 88.8% accuracy, 96.8% precision, 91.7% recall and 88.8% F-measure revealing that the method is good at solving the problem of table extraction.展开更多
The existing data mining methods are mostly focused on relational databases and structured data, but not on complex structured data (like in extensible markup language(XML)). By converting XML document type descriptio...The existing data mining methods are mostly focused on relational databases and structured data, but not on complex structured data (like in extensible markup language(XML)). By converting XML document type description to the relational semantic recording XML data relations, and using an XML data mining language, the XML data mining system presents a strategy to mine information on XML.展开更多
The Extensible Markup Language (XML) is becoming a de-facto standard for exchanging information among the web applications. Efficient implementation of web application needs to be efficient implementation of XML and X...The Extensible Markup Language (XML) is becoming a de-facto standard for exchanging information among the web applications. Efficient implementation of web application needs to be efficient implementation of XML and XML schema document. The quality of XML document has great impact on the design quality of its schema document. Therefore, the design of XML schema document plays an important role in web engineering process and needs to have many schema qualities: functionality, extensibility, reusability, understandability, maintainability and so on. Three schema metrics: Reusable Quality metric (RQ), Extensible Quality metric (EQ) and Understandable Quality metric (UQ) are proposed to measure the Reusable, Extensible and Understandable of XML schema documents in web engineering process respectively. The base attributes are selected according to XML Quality Assurance Design Guidelines. These metrics are formulated based on Binary Entropy Function and Rank Order Centroid method. To check the validity of the proposed metrics empirically and analytically, the self-organizing feature map (SOM) and Weyuker’s 9 properties are used.展开更多
文摘This paper examines automatic recognition and extraction of tables from a large collection of het-erogeneous documents. The heterogeneous documents are initially pre-processed and converted to HTML codes, after which an algorithm recognises the table portion of the documents. Hidden Markov Model (HMM) is then applied to the HTML code in order to extract the tables. The model was trained and tested with five hundred and twenty six self-generated tables (three hundred and twenty-one (321) tables for training and two hundred and five (205) tables for testing). Viterbi algorithm was implemented for the testing part. The system was evaluated in terms of accuracy, precision, recall and f-measure. The overall evaluation results show 88.8% accuracy, 96.8% precision, 91.7% recall and 88.8% F-measure revealing that the method is good at solving the problem of table extraction.
文摘The existing data mining methods are mostly focused on relational databases and structured data, but not on complex structured data (like in extensible markup language(XML)). By converting XML document type description to the relational semantic recording XML data relations, and using an XML data mining language, the XML data mining system presents a strategy to mine information on XML.
文摘The Extensible Markup Language (XML) is becoming a de-facto standard for exchanging information among the web applications. Efficient implementation of web application needs to be efficient implementation of XML and XML schema document. The quality of XML document has great impact on the design quality of its schema document. Therefore, the design of XML schema document plays an important role in web engineering process and needs to have many schema qualities: functionality, extensibility, reusability, understandability, maintainability and so on. Three schema metrics: Reusable Quality metric (RQ), Extensible Quality metric (EQ) and Understandable Quality metric (UQ) are proposed to measure the Reusable, Extensible and Understandable of XML schema documents in web engineering process respectively. The base attributes are selected according to XML Quality Assurance Design Guidelines. These metrics are formulated based on Binary Entropy Function and Rank Order Centroid method. To check the validity of the proposed metrics empirically and analytically, the self-organizing feature map (SOM) and Weyuker’s 9 properties are used.