This paper examines automatic recognition and extraction of tables from a large collection of het-erogeneous documents. The heterogeneous documents are initially pre-processed and converted to HTML codes, after which ...This paper examines automatic recognition and extraction of tables from a large collection of het-erogeneous documents. The heterogeneous documents are initially pre-processed and converted to HTML codes, after which an algorithm recognises the table portion of the documents. Hidden Markov Model (HMM) is then applied to the HTML code in order to extract the tables. The model was trained and tested with five hundred and twenty six self-generated tables (three hundred and twenty-one (321) tables for training and two hundred and five (205) tables for testing). Viterbi algorithm was implemented for the testing part. The system was evaluated in terms of accuracy, precision, recall and f-measure. The overall evaluation results show 88.8% accuracy, 96.8% precision, 91.7% recall and 88.8% F-measure revealing that the method is good at solving the problem of table extraction.展开更多
PLSA和LDA主题模型主要是研究纯文本内容。最近,开始提出用主题模型处理超文本,所提出的超文本模型是生成模型,引出了词和超链接的关系。由于超文本的文档词分布不仅由文档主题决定,也由引用的文档的主题决定。因此提出了一种基于主题...PLSA和LDA主题模型主要是研究纯文本内容。最近,开始提出用主题模型处理超文本,所提出的超文本模型是生成模型,引出了词和超链接的关系。由于超文本的文档词分布不仅由文档主题决定,也由引用的文档的主题决定。因此提出了一种基于主题模型的LPAL(Link PLSA And LDA)模型处理超文本的主题发现和文档分类。和传统的主题模型一样,该主题模型进一步的表示了词的分布。实验结果表明,该模型在主题发现和文档分类要优于传统的LDA和Link-LDA模型。展开更多
文摘This paper examines automatic recognition and extraction of tables from a large collection of het-erogeneous documents. The heterogeneous documents are initially pre-processed and converted to HTML codes, after which an algorithm recognises the table portion of the documents. Hidden Markov Model (HMM) is then applied to the HTML code in order to extract the tables. The model was trained and tested with five hundred and twenty six self-generated tables (three hundred and twenty-one (321) tables for training and two hundred and five (205) tables for testing). Viterbi algorithm was implemented for the testing part. The system was evaluated in terms of accuracy, precision, recall and f-measure. The overall evaluation results show 88.8% accuracy, 96.8% precision, 91.7% recall and 88.8% F-measure revealing that the method is good at solving the problem of table extraction.
文摘PLSA和LDA主题模型主要是研究纯文本内容。最近,开始提出用主题模型处理超文本,所提出的超文本模型是生成模型,引出了词和超链接的关系。由于超文本的文档词分布不仅由文档主题决定,也由引用的文档的主题决定。因此提出了一种基于主题模型的LPAL(Link PLSA And LDA)模型处理超文本的主题发现和文档分类。和传统的主题模型一样,该主题模型进一步的表示了词的分布。实验结果表明,该模型在主题发现和文档分类要优于传统的LDA和Link-LDA模型。