期刊文献+

基于模式识别算法的网页重复信息抽取仿真 被引量:2

Simulation of Web Repetitive Information Extraction Based on Pattern Recognition Algorithm
下载PDF
导出
摘要 当前的网页重复信息抽取方法缺少信息分类步骤,导致传统方法存在抽取全面率低、重复信息比例高以及整体性能差的问题。于是提出基于模式识别算法的网页重复信息抽取方法。利用类间平衡因子和词频获取网页信息的互信息特征。在关联规则的基础上根据网页置信度向量化互信息特征,完成网页信息特征的提取。利用模式识别中的支持向量机对网页信息分类,优化惩罚函数,建立软间隔支持向量机分类器。计算不同类别网页信息的结构相似度和语义相似度,结合上述计算结果获得网页信息相似性,完成网页重复信息的抽取。仿真结果表明,所提方法的抽取全面率高、重复信息比例低,且整体应用性能好,实验结果表明所提方法具有理想的应用效果。 The traditional web page duplicate information extraction methods have low extraction overall rate,a high proportion of duplicate information,and poor overall performance,being caused by the lack of information classification steps.Therefore,a web page duplicate information extraction method based on a pattern recognition algorithm is put forward.The inter-class balance factor and word frequency were applied to obtain the mutual information characteristics of web page information.According to the association rules and web page confidence,the mutual information features were vectorized,and the web page information features were extracted for optimizing the penalty function and founding a soft interval support vector machine classifier.The structural similarity and semantic similarity of different types of web page information were calculated to obtain the similarity of web page information,thus completing the extraction of web page duplicate information.The simulation results show that this method has a high extraction rate,a low proportion of repeated information,and excellent overall application performance.
作者 李玉琦 李龙 LI Yu-qi;LI Long(Beijing University of Posts and Telecommunications,Beijing 100876,China;University of Science and Technology of China,Hefei Anhui 230026,China)
出处 《计算机仿真》 北大核心 2022年第3期439-443,共5页 Computer Simulation
关键词 模式识别算法 网页重复信息 特征提取 支持向量机 信息抽取 Pattern recognition algorithm Webpage repetitive information Feature extraction Support vector machine Information extraction
  • 相关文献

参考文献12

二级参考文献81

共引文献113

同被引文献52

引证文献2

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部