摘要
对互联网上大量存在的基于模板的网页,根据其半结构化的特点,提出了一种网页分块和主题信息自动提取算法.该算法利用网页标记对网页进行分块,改进了传统的文本特征选择方法,把网页块表示成特征向量,并根据有序标记集识别主题内容块.用该算法改进了网页分类的预处理过程,提高了分类的速度和准确性.实验表明,对网页进行主题信息提取后再进行分类,可以提高分类系统的查全率和查准率.
According to the semi-structure of the template-based Web pages in the Internet,an algorithm which can identify the topic content blocks was proposed.In this algorithm,the Web-page is segmented according to the HTML tags,and the Web page block is represented as feature vector,which improved the traditional text feature selection method.After using the Algorithm in the pretreatment of Web page classification,the speed and correctness of the classification was improved a lot.Experiment shows that the algorithm can improve the precision and recall of a classification after the topic content extraction procedure.
出处
《华中科技大学学报(自然科学版)》
EI
CAS
CSCD
北大核心
2007年第10期39-41,共3页
Journal of Huazhong University of Science and Technology(Natural Science Edition)
关键词
网页分块
主题信息
自动提取
特征选择
网页分类
Web-page segmentation
topic content information
automate extraction
feature selection
Web page classification