期刊文献+

基于Web标准的页面分块算法研究 被引量:2

Web Standard Based Segmentation of Web Pages
下载PDF
导出
摘要 页面分块在文档分类,信息抽取,主题信息采集,以及搜索引擎优化等方面具有重要的作用。首先提出了一种基于Web标准的页面分块算法,通过对网页进行解析和布局分析,利用Web标准对网页进行分块。实验证明该算法在对遵循Web标准的网页进行分块时,在分块准确性和复杂页面适应性方面得到了提高。 Web page segmentation plays an important role in the document classification, information extraction, topic information collection, as well as search engine optimization. In this paper, we use the web standard to propose a web standard based web page segmentation algorithm, through the pages and layout of analytic analysis. Experiments show that the segmentation algorithm following web standards at segmentation accuracy and complexity of adaptive aspects of the page has been improved.
出处 《微处理机》 2009年第6期58-61,共4页 Microprocessors
基金 国家自然科学基金青年基金资助(编号60403009)
关键词 页面分块 层叠样式表 语义块 Web Page Segmentation Cascading Style Sheets Semantic Block
  • 相关文献

参考文献8

  • 1Lan Yi, Bing Liu. Web page cleaning for web mining through feature weighting [ C ]. International Joint Conference on Artificial Intelligence (IJCAI),Acapulco:IJCAI, 2003.
  • 2瞿有利,于浩,徐国伟,西野文人.Web页面信息块的自动分割[J].中文信息学报,2004,18(1):6-13. 被引量:10
  • 3Y Chen,X Xie, WY Ma, HJ Zhang. Adapting Web pages for small - screen devices [ J ]. Internet Computing, IEEE, 2005,9( 1 ) :50 - 56.
  • 4Embley D W, Jiang Y, Ng Y - K. Record - Boundary Discovery in Web Documents [ C ]. In Proceedings of SIGMOD international conference on Management of data, Philadelphia USA, 1999:467 - 478.
  • 5Cai D, S Yu, JR Wen, WY Ma. VIPS: a Vision - based Page Segmentation Algorithm [ C ]. Microsoft Technical Report, MSR : MSR - TR - 2003 - 79,2003.
  • 6Xiaofei He, Deng Cai, Ji - Rong Wen, Wei - Ying Ma, Hong - Jiang Zhang. Clustering and searching WWW images using link and page layout analysis [ C ]. TOMCCAP, New York : ACM ,2007.
  • 7Deng Cai, Xiaofei He, Zhiwei Li, Wei - Ying Ma, Ji - Rong Wen : Hierarchical clustering of WWW image search results using visual,textual and link information[ C]. New York : ACM Multimedia ,2004.
  • 8W3C Technical Reports and Publications [ EB/OL]. http ://www. w3c. org.

二级参考文献5

  • 1[1]Line Eikvil, Information Extraction from World Wide Web- A Survey[M], Report No. 945, Norwegian Computing Center, ISBN 82-539-0429-0, July, 1999.
  • 2[2]Chia-Hui Chang, Shao-Chen Lui , IEPAD: Information Extraction Based on Pattern Discovery [C], Proceedings of the Tenth International World Wide Web Conference, Hong Kong , May 2001. http:// www10.org/ cdrom/ papers/223/.
  • 3[3]Embley D.W., Jiang Y.S., Ng Y.K., Record-Boundary Discovery in Web Documents[C], Proceedings of SIGMOD, Philadelphia, USA, 1999.
  • 4[4]Morrison, D.R. Journal of ACM [J], 15:514-534.
  • 5[5]E. Ukkonen. On-line construction of suffix-tree[J], algorithmica,14:249-60,1995.

共引文献9

同被引文献12

引证文献2

二级引证文献7

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部