一种新的HTML页面清洗压缩算法被引量：1

下载PDF

导出

摘要本文提出了一种新的适用于Web信息抽取的HTML页面清洗压缩算法。该算法充分利用了HTML页面树中各标签的相对位置信息。实验表明,该算法能够有效地处理页面中的语法错误,并实现对页面冗余数据的压缩,具有良好的实用价值和应用前景。

作者任仲晟

机构地区福建师范大学数学与计算机科学学院

出处《福建电脑》 2009年第1期60-61,共2页 Journal of Fujian Computer

关键词 HTML页面清洗 HTML页面压缩预处理信息抽取

参考文献1

1黄奇,李伟,接晓莉.网络半结构化信息资源的描述[J].图书情报工作,2002,46(2):70-72. 被引量：4
2仇伟涛,范家铭,李丹.一种基于HBase的高效半结构化数据查询策略[J].福建电脑,2014,30(11):107-110. 被引量：1
3王宝会,邢景轩,高远.运用FastDFS和Drill构建海量BIM族数据存储和查询平台[J].土木建筑工程信息技术,2016,8(6):23-28. 被引量：3
4潘洁珠.半结构化数据及其数据模型[J].安徽教育学院学报,2003,21(6):32-33. 被引量：1

1Sandip Debnath, Prasenjit Mitra, Nirmal Pal, et al. Automatic identification of informative sections of web pages[J]. IEEE transactions on knowledge and data engineering, 2005, 17 (9): 1233-1246.
2Jiying Wang, Fred H Lochovsky. Data-rich section extraction from HTML pages[C]//Proceedings of the 3rd International Conference on Web Information Systems Engineering (WISE'02), 2002: 313-322.
3Lan Yi, Bing Liu, Xiaoli Li. Eliminating noisy information in web pages for data mining[C]//Proc Ninth ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining, 2003: 296-305.
4Lan Yi, Bing Liu. Web Page Cleaning for web mining through feature weighting [C]//Proeeedings of Eighteenth International Joint Conference on Artifieial Intelligenee. Aeapulco, Mexico, 2003 : 9- 15.
5Ji He, Ah-Hwee Tan, Chew-Lim Tan, et al. On quantitative evaluation of clustering systems [J]. Information Retriveal And Clustering, 2002:105-134.
6Wuu Yang. Identifying syntactic differences between two programs [J]. Software-practice and Experience, 1991,21 (7): 739-755.
7Raghavan V V,Wang G S,Bollmann P. A critical investigation of recall and precision as measures of retrieval system performance [J]. ACM Trans Information Systems, 1989 (3): 205-229.

福建电脑

2009年第1期

内容加载中请稍等...