摘要
针对传统的信息抽取方法在提取卷期目录链接时精度不高的问题,本文提出一种基于网页分块和链接特征的卷期目录链接提取方法。首先,以网页标签树的布局标签为最小粒度,提出一种原子网页分块算法,将网页分割为若干个相互独立、互不包含的内容块;其次,根据内容块的子树结构,提出一种原子内容块聚类算法,通过合并相似内容块对网页进行语义块划分;最后,提出一种卷期目录链接块的识别算法,通过融合链接文本相似度和基于Bayes的语义分析方法识别出卷期目录链接区域,从而实现链接的提取。实验结果表明,本文提出的方法能够有效提取卷期Et录链接。
Traditional information extraction methods have low precision when extracting links from issuses' table of contents. With this problem in mind, in this paper we propose an approach to extract links from issuses' table of contents based on Web page segmentation and link features. We first present an atomic page segmentation algorithm based on page tag tre~, which splits the page into several independent and mutual non-inclusion content blocks. Then we propose an atomic content block clustering algorithm according to the sub-tree structure of the content blocks, which divides web page into semantic blocks by merging several blocks with similar content structures. Finally, we present a link blocks identification algorithm, which combines the similarity of link texts and Bayes-based semantic analysis method to identify link area from issuses' table of contents in order to extract the links. Experimental results show that the proposed method can effectively extract links from issuses' table of contents.
出处
《情报学报》
CSSCI
北大核心
2012年第7期686-693,共8页
Journal of the China Society for Scientific and Technical Information
基金
教育部科技发展中心网络时代的科技论文快速共享专项研究资助课题(20101333110013,2011109)
河北省自然科学基金资助项目(F2011203219).
关键词
网页分块
链接块
卷期目录
链接提取
page segmentation, link blocks, issues' table of contents, link extraction