期刊文献+

网页Pagelet的自动抽取方法

Auto-extraction methods of Web pagelet
下载PDF
导出
摘要 W eb页面中除了所包含的数据外,往往还包含很多导航信息、广告等。针对W eb页面的特点,提出了DOM树比较算法,通过对多个页面进行比较,识别出主体内容。实验结果证明该方法是有效可行的。 Besides the needed data, there are lots of navigation information and advertisements in the Web pages. A DOM tree comparison algorithm was proposed. It compared several pages within a class, and recognized the main contents in pages. Experiment results show that it is feasible and effective.
作者 朱明 李伟
出处 《计算机应用》 CSCD 北大核心 2005年第11期2612-2614,共3页 journal of Computer Applications
关键词 WEB挖掘 信息获取 DOM相似度 DOM节点聚类 Web mining information retrieval DOM similarity DOM node clustering
  • 相关文献

参考文献7

  • 1CAVERLEE J, LIU L, BUTTLER D. Probe, Cluster, and Discover: Focused Extraction of QA-Pagelets from the Deep Web[A]. Proceedings of the 20th IEEE International Conference on Data Engineering (ICDE04)[C], 2004, 103-114.
  • 2NIERMAN A, JAGADISH HV. Evaluating Structural Similarity in XML Documents[A]. Proceedings of the Fifth International Workshop on the Web and Databases[C], 2002. 61-66.
  • 3LIAN W, CHEUNG DW, MAMOULIS N, et al. An Efficient and Scalable Algorithm for Clustering XML Documents by Structure[J]. IEEE Transactions on Knowledge and Data Engineering, 2004,16(1): 82-96.
  • 4BERGMAN M. The deep web: Surfacing hidden value[M]. Bright-Planet, 2000.
  • 5ZELDMAN J. Designing With Web Standards[M]. New Riders Publishing, 2003.
  • 6FLESCA S, MANCO G, MASCIARI E, et al. Fast Detection of XML Structural Similarity[J]. IEEE Transactions on Knowledge and Data Engineering, 2005,17(2): 160-175.
  • 7MA L, GOHARIAN N, CHOWDHURY A, et al. Extracting unstructured data from template generated web documents[A]. Proceedings of the twelfth international conference on Information and knowledge management[C], 2003. 512-515.

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部