网页Pagelet的自动抽取方法

Auto-extraction methods of Web pagelet

下载PDF

导出

摘要 W eb页面中除了所包含的数据外,往往还包含很多导航信息、广告等。针对W eb页面的特点,提出了DOM树比较算法,通过对多个页面进行比较,识别出主体内容。实验结果证明该方法是有效可行的。 Besides the needed data, there are lots of navigation information and advertisements in the Web pages. A DOM tree comparison algorithm was proposed. It compared several pages within a class, and recognized the main contents in pages. Experiment results show that it is feasible and effective.

作者朱明李伟

机构地区中国科学技术大学自动化系

出处《计算机应用》 CSCD 北大核心 2005年第11期2612-2614,共3页 journal of Computer Applications

关键词 WEB挖掘信息获取 DOM相似度 DOM节点聚类 Web mining information retrieval DOM similarity DOM node clustering

分类号 TP393.09 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献7

1CAVERLEE J, LIU L, BUTTLER D. Probe, Cluster, and Discover: Focused Extraction of QA-Pagelets from the Deep Web[A]. Proceedings of the 20th IEEE International Conference on Data Engineering (ICDE04)[C], 2004, 103-114.
2NIERMAN A, JAGADISH HV. Evaluating Structural Similarity in XML Documents[A]. Proceedings of the Fifth International Workshop on the Web and Databases[C], 2002. 61-66.
3LIAN W, CHEUNG DW, MAMOULIS N, et al. An Efficient and Scalable Algorithm for Clustering XML Documents by Structure[J]. IEEE Transactions on Knowledge and Data Engineering, 2004,16(1): 82-96.
4BERGMAN M. The deep web: Surfacing hidden value[M]. Bright-Planet, 2000.
5ZELDMAN J. Designing With Web Standards[M]. New Riders Publishing, 2003.
6FLESCA S, MANCO G, MASCIARI E, et al. Fast Detection of XML Structural Similarity[J]. IEEE Transactions on Knowledge and Data Engineering, 2005,17(2): 160-175.
7MA L, GOHARIAN N, CHOWDHURY A, et al. Extracting unstructured data from template generated web documents[A]. Proceedings of the twelfth international conference on Information and knowledge management[C], 2003. 512-515.

1朱毅华,张超群,曾通,吴龙凤,徐玛丽,王东波,李晓晖.基于子树相似度计算的网页评论提取算法研究[J].现代图书情报技术,2013(11):52-59. 被引量：5

计算机应用

2005年第11期

浏览历史

内容加载中请稍等...

网页Pagelet的自动抽取方法

参考文献7

相关作者

相关机构

相关主题

浏览历史