期刊文献+

基于网页分块和链接特征的卷期目录链接提取方法 被引量:1

Extracting Links for Volumns' Issue and Table of Contents Based on Web Page Segmentation and Link Features
下载PDF
导出
摘要 针对传统的信息抽取方法在提取卷期目录链接时精度不高的问题,本文提出一种基于网页分块和链接特征的卷期目录链接提取方法。首先,以网页标签树的布局标签为最小粒度,提出一种原子网页分块算法,将网页分割为若干个相互独立、互不包含的内容块;其次,根据内容块的子树结构,提出一种原子内容块聚类算法,通过合并相似内容块对网页进行语义块划分;最后,提出一种卷期目录链接块的识别算法,通过融合链接文本相似度和基于Bayes的语义分析方法识别出卷期目录链接区域,从而实现链接的提取。实验结果表明,本文提出的方法能够有效提取卷期Et录链接。 Traditional information extraction methods have low precision when extracting links from issuses' table of contents. With this problem in mind, in this paper we propose an approach to extract links from issuses' table of contents based on Web page segmentation and link features. We first present an atomic page segmentation algorithm based on page tag tre~, which splits the page into several independent and mutual non-inclusion content blocks. Then we propose an atomic content block clustering algorithm according to the sub-tree structure of the content blocks, which divides web page into semantic blocks by merging several blocks with similar content structures. Finally, we present a link blocks identification algorithm, which combines the similarity of link texts and Bayes-based semantic analysis method to identify link area from issuses' table of contents in order to extract the links. Experimental results show that the proposed method can effectively extract links from issuses' table of contents.
出处 《情报学报》 CSSCI 北大核心 2012年第7期686-693,共8页 Journal of the China Society for Scientific and Technical Information
基金 教育部科技发展中心网络时代的科技论文快速共享专项研究资助课题(20101333110013,2011109) 河北省自然科学基金资助项目(F2011203219).
关键词 网页分块 链接块 卷期目录 链接提取 page segmentation, link blocks, issues' table of contents, link extraction
  • 相关文献

参考文献9

  • 1Cai D, Yu S, Wen J R, et al. VIPS: Improving Pseudo- Relevance Feedback in Web Information Retrieval Using Web Page Segmentation [ C ]//Proceeding of The 12th International Conference on World Wide Web,2003.
  • 2于满泉,陈铁睿,许洪波.基于分块的网页信息解析器的研究与设计[J].计算机应用,2005,25(4):974-976. 被引量:55
  • 3Abel O, Li Longzhuang, Liu Yonghuai. Visual Segmen- tation-Based Data Record Extraction from Web Documents [ C ]//Proceedings of IEEE International Conference on Information Reuse and Integration, 2007: 502-507.
  • 4侯明燕,杨天奇.基于网页分割的Web信息提取算法[J].微型机与应用,2011,30(5):54-56. 被引量:2
  • 5陈翰生,曾剑平,张世永.一种基于位置信息的Web页面分割方法[J].计算机应用与软件,2009,26(7):155-159. 被引量:3
  • 6Kovacevic M, Diligenti M, Coil M, et al. Recognition of Common Areas in a Web Page Using Visual Information : a possible application in a page classification [ C ]//In Proceedings of the 2002 IEEE International Conference on Data Mining (ICDM2002) Maebashi City. Japan. 2002 : 250-257.
  • 7王芳,于浩,谭红叶,赵铁军.基于链接分块的相关链接提取方法[J].计算机工程与应用,2006,42(31):110-113. 被引量:2
  • 8Bille P. A survey on tree edit distance and relatedproblems [ J ]. Theoretical Computer Science, 2005,337 (1-3) :217-239.
  • 9Liu B, Grossman RL, Zhai Y pages [ C ]//Proc. Of the Discovery and Data Mining ACM Press ,2003:601-606. Mining data records in Web Int' 1 Conf on Knowledge ( KDD 2003 ). Washington :.

二级参考文献37

  • 1欧健文,董守斌,蔡斌.模板化网页主题信息的提取方法[J].清华大学学报(自然科学版),2005,45(S1):1743-1747. 被引量:70
  • 2张志刚,陈静,李晓明.一种HTML网页净化方法[J].情报学报,2004,23(4):387-393. 被引量:57
  • 3朱精南,赵明生.网页版面中区域几何信息的确定[J].计算机工程,2004,30(10):45-48. 被引量:4
  • 4于满泉,陈铁睿,许洪波.基于分块的网页信息解析器的研究与设计[J].计算机应用,2005,25(4):974-976. 被引量:55
  • 5高波.嵌入式浏览器开发.http://jserv.sayya.org/netbit/.
  • 6Cobra HTML Parser.http://lobobrowser.org/cobra.jsp.
  • 7HTML 4.01 Specification.http://www.w3.org/TR/REC-html40/.
  • 8Vadrevu S,Gelgi F.Information Extraction from Web Pages Using Presentation Regularities and Domain Knowledge.World Wide Web,2007,10:157.
  • 9Arasu A,Garcia-Molina H.Extracting Structured Data from Web Pages.International Conference on Management of Data,Proceedings of the 2003 ACM SIGMOD international conference on Management of data,2003.
  • 10Deng Cai,Shipeng Yu,Ji-Rong Wen,et al.VIPS:a Vision-based Page Segmentation Algorithm.http://research.microsoft.com/~jrwen/jrwen_files/publications/VIPS_Technical%20Report.PDF 2003.

共引文献57

同被引文献4

引证文献1

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部