摘要
目前网页上出现越来越多的广告信息,使得准确抽取网页正文信息变得越来越难.针对这一问题,文中提出了一种基于权值优化的网页正文内容提取算法.该算法首先通过分析网页正文内容的特点,确定主题块的特征属性,得出这些属性的统计特征;然后,利用各个特征属性具有不同重要性的特点,使用粒子群优化算法对特征权值及阈值进行了优化和确定,使其性能得到进一步的提升;最后通过实验对该方法进行验证.结果表明,与未经权值优化的提取算法相比,在基本维持相同精确率的基础上,该方法可使网页正文内容提取的召回率提升至95.8%.
With the increase in advertisement amount in HTML pages,it becomes more and more difficult to extract content accurately.In order to solve this problem,an algorithm of content extraction from HTML pages is proposed based on optimized weight.In this algorithm,first,the features of the content are analyzed to obtain the statistical features of the attributes by analyzing the characteristics of the content block in web pages.Then,in view of diffe-rent importance of the features,the weight and threshold of the features are optimized by using the particle swarm optimization algorithm,which further improves the performance of the algorithm.Finally,some experiments are performed to verify the effectiveness of the algorithm.The results show that,as compared with the algorithm with un-optimized weight,the proposed algorithm improves the recall rate of content extraction to 95.8% without reducing the precision.
出处
《华南理工大学学报(自然科学版)》
EI
CAS
CSCD
北大核心
2011年第4期32-37,共6页
Journal of South China University of Technology(Natural Science Edition)
基金
国家"973"计划项目(2007CB311106)
关键词
权值优化
正文内容提取
特征属性
统计特征
准确率
召回率
weight optimization
content extraction
feature attribute
statistical feature
precision
recall rate