期刊文献+

一种基于树匹配的网页语义块挖掘算法 被引量:7

Algorithm for Webpage Semantic Blocks Mining Using Tree Match Method
下载PDF
导出
摘要 在互联网中,网页等半结构化文本通常由不同的语义区块组合而成,定位和挖掘这类区块对网页内容理解、页面结构分析等有着重要的作用.然而由于不同网页在结构和内容上都存在着较大的区别,准确的从不同的网页中定位特定的结构区域是一个相对复杂的任务.主要提出一种基于树匹配的方法用来挖掘网页中的语义区块,并通过剪枝等策略优化算法.实验表明该方法能有效提高F值,同时算法的性能有较大改善. In the WWW, many web documents are combined with various semantic regions. Discovery and mining such regions has a significant effort for web page analysis, user browser experience improvement, etc. But because of the difference of web page structure and content among large amounts of web pages, it is hard to detect such common regions effectively and correctly, traditional matching methods such as regular expression are not suitable for this problem. This paper proposes a region detection method based on tree match algorithm. As is shown according the experiments, the method this paper described improves F-Measure value, besides this method also reduces computation cost.
出处 《小型微型计算机系统》 CSCD 北大核心 2009年第8期1541-1545,共5页 Journal of Chinese Computer Systems
基金 国家"八六三"高技术研究发展计划基金项目(2006AA01Z449)资助 国家"八六三"高技术研究发展计划基金项目(2008AA01Z408)资助
关键词 编辑距离 树匹配 数据挖掘 剪枝 tree edit distance algorithm data mining pruning
  • 相关文献

参考文献2

二级参考文献9

  • 1高军,杨冬青,唐世渭,王腾蛟.基于树自动机的XPath在XML数据流上的高效执行[J].软件学报,2005,16(2):223-232. 被引量:33
  • 2Yoshida M, Torisawa K, Tsujii J. Extracting attributes and their values from web pages [C]// Antonacopoulos A, Hu Jianying. Web Document Analysis : Challenges and Opportunities. Singapore : World Scientific Publishing, 2003:179 - 200.
  • 3Lim Seungjin, Ng Yiukai. retrieving hierarchical data Proceedings of the Eighth Information and Knowledge ACM, 1999: 466-474. An automated approach for from HTML tables [C] // International Conference on Management. Kansas City:
  • 4LIU Jiexue, AO Zhuoyun, Park H H, et al. An XML approach to semantically extract data from HTML tables [C]// Database and Expert Systems Applications, DEXA 2005, Lecture Notes in Computer Science 3588. Heidelberg: Springer Berlin, 2005:696-705.
  • 5Kim Yeonseok, Lee Kyongho. Extracting table information from the Web [C] // Document Analysis Systems VI. 6th International Workshop, DAS 2004, Lecture Notes in Computer Science 3163, 2004:438 - 441.
  • 6Tanaka M, Ishida T. Ontology extraction from tables on the web [C] // Proceedings of the International Symposium on Applications on Internet in SAINT-06. Washington: IEEE Computer Society, 2006: 284- 290.
  • 7Hsiao Shuling, Chou Shihchun, Chang Luping. Information extraction from HTML tables base on domain ontology [C]// International Conference on Information and Knowledge Engineering-IKE' 03. Las Vegas: CSREA Press, 2003 : 70 - 78.
  • 8LI Shijun, PENG Zhiyong, LIU Mengchi. Extraction and integration information in HTML tables [C] // Fourth International Conference on Computer and Information Technology. Nanjing, China, 2004: 315-320.
  • 9Yoshida M, Torisawa K, Tsujii J. Extracting ontologies from world wide web via HTML tables [C] //Proceedings of the Pacific Association for Computational Linguistics. Kitakyushu, Japan, 2001 : 332 - 341.

共引文献16

同被引文献59

  • 1Crescenzi V, Mecca G, Merialdo P. Wrapping - oriented Classification of Web Pages [ C ]. In : Proceedings of the 2002 ACM Symposium on Applied Computing. New York : ACM Press, 2002 : 1105-1112.
  • 2Crescenzi V, Mecca G, Merialdo P. RoadRunner: Towards Auto-matic Data Extraction from Large Web Sites [ C ]. In : Proceedings of the 27th International Conference on Very Large Data Base. San Francisco, CA, USA : Morgan Kaufman Publishers Inc. , 2001 : 109-118.
  • 3Reis D C, Golgher P B, Silva A S, et al. Automatic Web News Extraction Using Tree Edit Distance [ C ]. In: Proceedings of the 13th International Conference on World Wide Web. New York, NY, USA:ACM,2004:502-511.
  • 4Zheng S Y, Wu D, Song R H, et al. Joint Optimization of Wrapper Generation and Template Detection[ EB/OL]. [ 2009 -11 - 05 ]. http ://www. cse. psu. edu/- shzheng/sigkdd - 2007. pdf.
  • 5Tai K C. The Tree - to - Tree Editing Correction Problem [ J ]. Journal of the ACM, 1979,26(3) :422 -433.
  • 6Yang W. Identifying Syntactic Differences Between Two Programs [ EB/OL]. [2009 - 11 -05 ]. http://eprints, kfupm, edu. sa/ 44597/1/44597. pdf.
  • 7乔少杰 唐常杰 陈瑜等.基于树编辑距离的层次聚类算法.计算机科学与探索,2007,1(3):282-292.
  • 8Crescenzi V, Mecca G, Merialdo P. Wrapping-oriented Classification of Web Pages E C ]//Proceedings of the2002ACM Symposium on Applied Computing. New York: ACM Press, 2002: 1108-1112.
  • 9Crescenzi V, Mecca G, Merialdo P. RoadRunner : Towards Auto-matic Data Extraction from Large Web Sites [ C ]// Proceedings of the27th International Conference on Very Large Data Base. San Francisco, CA, USA: Morgan Kaufman Publishers Inc. , 2001: 109-118.
  • 10Reis D C, Golgher P B, Silva A S, et al. Automatic Web News Extraction Using Tree Edit Distance [ C ]// Proceedings of the 13th International Conference on World Wide Web. New York, NY, USA: ACM, 2004: 502-511.

引证文献7

二级引证文献25

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部