摘要
为了保证WEB到WAP转换结果的完备精练,主要针对转换过程中无用信息去除问题,提出一套页面去噪解决方案。首先根据节点大小位置通过算法判断节点是否为核心内容,在此基础上计算节点链接比,同阈值进行对比,进一步明确节点类型,为了避免误删,对于可能的噪音模块,采用正则表达式检测节点中链接的指向,如绝大多数指向其它网站,则断定该节点为噪音节点。最终通过构建项目实验平台,对该解决方案进行评估,证明该方案的有效性和可靠性。
In order to ensure the completeness and conciseness of the transformation from WEB to WAP,the article mainly aims at the removing of useless information to present a set of web page de-noising solution.At first,it determines with an algorithm by a node's size and location whether it represents the core content.On this basis,it calculates the link ratio of nodes to compare against threshold values to further judge their types.In order to avoid wrong deletion,for the suspected noise nodes,regular expressions are taken to detect the direction of links in nodes.If most links direct to other websites,the node is classified as a noise node.At last,through constructing a project experimental platform,the solution is evaluated and is proven to be effective and reliable.
出处
《计算机应用与软件》
CSCD
北大核心
2012年第4期178-179,199,共3页
Computer Applications and Software
基金
内蒙古自治区自然科学基金项目(2010MS0913)
内蒙古工业大学科学研究项目(ZS201004)
关键词
移动互联网
网页去噪
广告去除
页面结构
链接比
正则表达式
Mobile internet Web de-noising Removing advertisement Web page structure Link ratio Regular expression