摘要
本文提出一种基于模板和支持向量机(SVM)协同工作的网页去噪方法。该方法将网页噪声分为公共噪声和个性化噪声两类。首先从网页集合中建立模板库,利用模板去除网页公共噪声。对于剩下的个性化噪声,先计算块级标签特征,利用这些特征训练SVM模型,最后用训练好的SVM模型将块级标签分为噪声和正文两类,达到去噪目的。该方法能够有效去除主题型网页中的版权、导航、广告等噪声信息。与单纯使用SVM进行网页去噪相比,查准率和查全率上均有提升。
This paper presents a method of web page denoising based on template and support vector machine (SVM). This method divides web page noise into common noise and personalized noise. Firstly, a template library from the web page collection is established, and the common noise of web page will be removed by using the template. And then, the features for block-level labels are calculated, with which the SVM model is trained. Finally, the trained SVM model is used to divide block-level labels into noise and main text, achieving the purpose of denoising. This method can effectively remove the copyright, navigation, advertising and other noise information in the web page. Compared with the pure use of SVM for web page denoising, both accuracy and recall rate of this method were improved.
出处
《计算机科学与应用》
2020年第1期51-59,共9页
Computer Science and Application
基金
成都市科技计划项目资助(2019-RK00-00015-ZF).