摘要
主要研究"正文式"网页的有效信息提取算法。该种底层网页真正含有Web页面所表达的主题信息,通常包含一大段的正文信息,正文信息的前后是一些格式信息(例如导航信息、交互信息、JavaScript脚本等)。分析了此种网页的页面结构特征,将问题转化为——给定一个底层网页的HTML源文件,求解最佳的正文区间;从而提出了一种基于快速傅立叶变换的网页正文内容提取算法。采用窗口分段的方法,利用统计学原理和FFT,得出每个可能区间的权值,从而求解出最佳正文区间。实验结果表明,此种方法能比较准确的对"正文式"网页的有效信息进行提取。
This paper studies the extraction algorithm of the effective information of "Content-Dominated" Web pages.This kind of Web pages contains the major content information of the Web sites.It includes a long paragraph of content main body,and format information in the beginning and the ending (e.g.navigation information,interaction information,JavaScript and so on).This paper analyzes the structural characteristics of this kind of Web page,and transformed the problem as :given an HTML source file of a "Content-Dominated" Webpage,to find the best range of the content main body.Presents an FFT-based extraction algorithm of webpage content main body.By applying window-segmentation,statistics theory and FFT,this method calculates the weight of every possible range;and thereby selects the best one as solution.The experimental result proves that this algorithm can efficiently extract the effective information of "Content-Dominated" Web pages.
出处
《计算机工程与应用》
CSCD
北大核心
2007年第30期148-151,共4页
Computer Engineering and Applications
基金
国家发改委CNGI示范工程资助项目(No.CNGI-04-15-2A)
关键词
中文信息处理
WEB页面
信息提取
页面结构
FFT
区域分割
Chinese information processing
Web page
information extraction
Web page structure
Fast Fourier Transformation (FTT)
page segmentation