一种基于节点密度分割和标签传播的Web页面挖掘方法被引量：13

A Method Based on Node Density Segmentation and Label Propagation for Mining Web Page

下载PDF

导出

摘要获取Web页面中的重要内容如文本和链接,在许多Web挖掘研究领域有着重要的应用价值.目前针对该问题主要采用Web页面分割和区块识别的方法.但现有的方法将Web页面中重要文本和链接的识别视为两个相互独立的问题,这种做法忽略了Web页面中文本和链接的内在语义关系,同时降低了页面处理的效率.文中提出了一种Web页面重要内容挖掘的统一框架,该框架主要由3个部分组成:第一,先将Web页面转换为DOM树表示,然后采用节点密度熵为度量将DOM树分割为不同的页面块;第二,采用基于K最近邻标签传播的半监督方法自动扩展页面块训练集;第三,在扩展的页面块训练集上对SVM分类器进行训练,并用来对页面块进行分类.采用该框架可以将Web页面块区分为多种类型,并且该框架独立于Web页面的类型和布局.我们在真实的Web环境下进行了广泛的实验,实验结果表明了该方法的有效性. For many research fields in Web mining, how to get the important content in a Web page, such as texts and links, has important applications. At present, the main method for solving this problem is to adopt Web page segmentation and informative sections recognition. However, existing approaches use decoupled strategies that attempt to do text content and link content identification in two separate phases. This ignores the inner semantic relationships between texts and links in a Web page, and also results in low efficiency of the processing of Web page. In this paper, we propose a uniform framework for mining important content in a Web page. This framework consists of three components. First, a Web page is transformed into a DOM tree, and then it is segmented into several Web page blocks with a metric based on node density entropy. Second, a semi-supervised approach based on K-Nearest Neighbor label propagation is proposed to automatically extend the training set for classification. Third, a SVM-based classifier is trained over the extended training set, and eventually it is leveraged to classify Web page blocks. The framework can distinguish Web page blocks into a variety of types, and it is independent of the type and layout of Web pages. We conduct the extensive experiment over real Web environment, and the experimental results show that the proposed methods are effective.

作者张乃洲曹薇李石君

机构地区河南财经政法大学计算机与信息工程学院武汉大学计算机学院

出处《计算机学报》 EI CSCD 北大核心 2015年第2期349-364,共16页 Chinese Journal of Computers

基金国家自然科学基金(61272109 61202285) 国家星火计划项目(2012GA750007) 河南省科技厅基础与前沿技术研究项目(122300410378) 河南省教育厅科学技术研究重点项目(13A520032)资助~~

关键词页面分割节点密度标签传播 DOM树块分类社会计算社交网络 Web page segmentation node density label propagation DOM tree block classification social computing social networks

分类号 TP391 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献24

1Yin Xinyi, Lee Wee Sun. Using link analysis to improve layout on mobile deviees//Proeeedings of the 13th International Conference on World Wide Web (WWW 2004). New York, USA, 2004:338-344.
2Cben Yu, Ma Wei-Ying, Zhang Hong-Jiang. Detecting Web page structure for adaptive viewing on small form factor devices//Proceedings of the 12th International Conference on World Wide Web (WWW 2003). Budapest, Hungary, 2003:225-233.
3Baluja S. Browsing on small screens: Recasting Web-page segmentation into an effcient machine learning framework// Proceedings of the 15th International Conference on World Wide Web (WWW 2006). Edinburgh, Scotland, 2006: 33-42.
4Sun Fei, Song Dandan, Liao Leiian. DOM based content extraction via text density//Proceedings of the 34th Annual International ACM SIGIR Conference (SIGIR 2011). Beijing, China, 2011:245-254.
5Cai Deng, Yu Shipeng, Wen J i-Rong, Ma Wei-Ying. Extracting content structure for Web pages based on visual representation //Proceedings of the 5th Asian-Pacific Web Conference (APWeb 2003). Xi'an, China, 2003:406-417.
6Yi Lan, Liu Bing, Li Xiaoli. Eliminating noisy information in Web pages for data mining//Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2003). Washington, USA, 2003: 296- 305.
7Ramaswamy L, Iyengar A, Liu Ling, Douglis F. Automatic fragment detection in dynamic Web pages and its impact on caching. IEEE Transactions on Knowledge and Data Engineering (TKDE), 2005, 17(6): 859-874.
8Debnath S, Mitra P, Pal N, Giles C L. Automatic identifica- tion of informative sections of Web pages. IEEE Transactions on Knowledge and Data Engineering (TKDE), 2005, 17(9) : 1233-1246.
9Kolcz A, Yih Wen-tau. Site-independent template-block detection//Proceedings of the 11th European Confereneeon on Principles and Practice of Knowledge Discovery in Databases (PKDD 2007). Warsaw, Poland, 2007:152-163.
10Chakrabarti D, Kumar R, Punera K. Page-level template detection via isotonic smoothing//Proceedings of the 16th International Conference on World Wide Web (WWW 2007). Banff, Canada, 2007:61-70.

二级参考文献17

1Chakrabarti Soumen,van den Berg Martin,Dom Byron.Focused crawling:A new approach to topic-specific Web resource discovery.Computer Networks (CN),1999,31(11-16):1623-1640.
2Chakrabarti Soumen,Punera Kunal,Subramanyam Mallela.Accelerated focused crawling through online relevance feedback//Proceedings of the 11th International Conference on World Wide Web (WWW 2002).Honolulu,Hawaii,USA,2002:148-159.
3Diligenti Michelangelo,Coetzee Frans,Lawrence Steve,Giles C Lee,Gori Marco.Focused crawling using context graphs//Proceedings of the 26th International Conference on Very Large Data Bases (VLDB 2000).Cairo,Egypt,2000:527-534.
4Barbosa Luciano,Freire Juliana.An adaptive crawler for locating hidden web entry points//Proceedings of the 16th International Conference on World Wide Web (WWW 2007).Banff,Alberta,Canada,2007:441-450.
5Rennie Jason,McCallum Andrew.Using reinforcement learning to spider the Web efficiently//Proceedings of the 16th International Conference on Machine Learning (ICML-99).Bled,Slovenia,1999:335-343.
6Guilherme T de Assis,Alberto H F Laender,Marcos André Gonalves,Altigran Soares da Silva.A genre-aware approach to focused crawling.World Wide Web (WWW),2009,12(3):285-319.
7Abiteboul S,Preda M,Cobena G.Adaptive on-line page importance computation//Proceedings of the 12th International Conference on World Wide Web (WWW 2003).Budapest,Hungary,2003:280-290.
8Guan Ziyu,Wang Can,Chen Chun,Bu Jiajun,Wang Junfeng.Guide focused crawler efficiently and effectively using on-line topical importance estimation//Proceedings of the 31st Annual International ACM SIGIR Conference (SIGIR 2008).Singapore,2008:757-758.
9Ahlers Dirk,Boll Susanne.Adaptive geospatially focused crawling//Proceedings of the 18th ACM Conference on Information and Knowledge Management (CIKM 2009).Hong Kong,China,2009:445-454.
10Yang Jiang-Ming,Cai Rui,Wang Chun-Song,Huang Hua,Zhang Lei,Ma Wei-Ying.Incorporating site-level knowledge for incremental crawling of Web forums:A list-wise strategy//Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2009).Paris,France,2009:1375-1384.

共引文献6

1刘兵,钱龙华,徐华,周国栋.依存信息在蛋白质关系抽取中的作用[J].中文信息学报,2011,25(2):21-26. 被引量：2
2何超,张玉峰.融合语义相似度的商务情报链接分析算法研究[J].现代图书情报技术,2013(3):27-32. 被引量：3
3曾光.基于分类指标体系的竞争情报分类分析模型研究[J].农业图书情报学刊,2014,26(1):5-8.
4陈先福,李石君,曾慧.基于极限学习机的网页分类应用[J].计算机工程与应用,2015,51(5):102-106. 被引量：1
5朱浩,连德富,左志宏,颜凯.余弦相似度在高校综合信息系统中的应用[J].东南大学学报（自然科学版）,2017,47(A01):123-128. 被引量：5
6姜琨,朱磊,王一川.基于动态隧道技术的主题爬行策略[J].计算机系统应用,2020,29(3):253-260. 被引量：1

同被引文献92

1荆涛,左万利.基于可视布局信息的网页噪音去除算法[J].华南理工大学学报（自然科学版）,2004,32(z1):84-87. 被引量：21
2韩戟.电力采购招投标管理信息系统应用[J].云南电力技术,2006,34(1):51-52. 被引量：7
3黄玲,陈龙.基于网页分块的正文信息提取方法[J].计算机应用,2008,28(S2):326-328. 被引量：13
4张引,陈敏,廖小飞.大数据应用的现状与展望[J].计算机研究与发展,2013,50(S2):216-233. 被引量：375
5李蕾,周延泉,王菁华.基于全信息的中文信息抽取系统及应用[J].北京邮电大学学报,2005,28(6):48-51. 被引量：11
6曹端,刘贞,粟增德.基于工程量清单计价的电力工程招投标管理系统[J].重庆大学学报（自然科学版）,2007,30(1):155-158. 被引量：7
7王强,关毅,王晓龙.基于标题类别语义识别的文本分类算法研究[J].电子与信息学报,2007,29(12):2885-2890. 被引量：6
8Xue Yewei, Hu Yunhua, Xin Guomao, et al. Web page ti- tle extraction and its application [ J ]. Information Process- ing and Management, 2007,43 (5) : 1332-1347.
9Fan Jian, Luo Ping, Joshi P. Title identification of Web article pages using HTML and visual features [ C ]/! Pro- ceedings of the International Society for Optical Engineer- ing, 2011. 2011,7879.
10Jericho HTML Parser. Jericho HTML Parser [ EB/OL]. ht- tp ://jericho. htmlparser, net/docs/index, html, 2015-03-10.

引证文献13

1张兵,汤进,罗斌.基于超链接和DOM结构树的网页标题实时抽取方法[J].计算机与现代化,2015(8):84-88. 被引量：2
2赵夫群.基于半监督学习的Web页面内容分类技术研究[J].现代电子技术,2016,39(1):108-112. 被引量：1
3王海艳,曹攀.基于节点属性与正文内容的海量Web信息抽取方法[J].通信学报,2016,37(10):9-17. 被引量：12
4李滢,魏俊奎,金义,马路遥,宋永春,王薇.电力评标系统平台的研发与应用[J].科技创新与应用,2017,7(2):190-191. 被引量：2
5孟川,武小年.基于文本特征值的正文抽取方法[J].桂林电子科技大学学报,2017,37(2):106-110. 被引量：2
6马晓慧,李泓莹.一种DOM树标签路径和行块密度结合的Web信息抽取方法[J].智能计算机与应用,2017,7(4):13-16. 被引量：4
7王一洲,陈星,戴远飞.基于网页聚类的正文信息提取方法[J].小型微型计算机系统,2018,39(1):111-115. 被引量：6
8任胜兵,王志健,王宇.Web页面自动化设计中布局挖掘和样式匹配算法[J].计算机工程与应用,2018,54(3):227-232. 被引量：2
9尤枫,张雅峰,赵瑞莲,马金慧.基于页面聚类的Web应用测试方法研究[J].计算机工程与应用,2018,54(5):51-56. 被引量：7
10张璞,王俊霞,王英豪.基于标签传播的情感词典构建方法[J].计算机工程,2018,44(5):168-173. 被引量：9

二级引证文献54

1冯建英,王博,吴丹丹,穆维松,田东.用户画像技术与其在农业领域应用研究进展[J].农业机械学报,2021,52(S01):385-395. 被引量：6
2顾唐杰,秦波,蒋小菲.一种基于改进型Chameleon算法的宿舍分配方法[J].智能计算机与应用,2022,12(5):23-30.
3蒋依明,邱实,曲国权,周立新,林涛.电子化评标系统的研究与应用[J].知识经济,2018(23):70-71. 被引量：2
4刘赛,聂庆节,岳峻松,刘军,姜建飞.多源数据库数据复制模型[J].计算机与现代化,2017(9):45-49. 被引量：3
5邓谦,余建新.工程服务类电子化阅标辅助功能研究[J].管理观察,2017(30):58-59. 被引量：2
6王磊.基于XML的Web信息采集系统设计与实现[J].齐齐哈尔大学学报（自然科学版）,2017,33(2):25-28. 被引量：3
7刘赛,聂庆节,刘军,刘嘉华,姜建飞,付晨.一种关系数据库数据抽取模型研究[J].电子设计工程,2018,26(6):16-21. 被引量：4
8程月.大数据资源中用户需求信息定向提取仿真[J].计算机仿真,2018,35(5):422-425. 被引量：4
9刘锐,谭文韬,付园斌,王红.一种通用论坛信息提取方法[J].小型微型计算机系统,2018,39(7):1398-1404.
10张佳俊,王一洲,陈星,张颖.基于DOM树抽象的包装器自动生成技术[J].计算机应用,2018,38(A01):150-154.

1孙明,陆春生,徐秀星,李庆忠,彭朝晖.一种基于SVM和AdaBoost的Web实体信息抽取方法[J].计算机应用与软件,2013,30(4):101-106. 被引量：3
2刘蕴,侯艳芳.Web Form中的页面处理[J].价值工程,2012,31(9):141-142.
3吴相智,刘卫国,费洪晓.一种基于栈结构的HTML到XML的转换方法[J].长沙交通学院学报,2004,20(2):64-67. 被引量：1
4赵婕,姚峰林.网络图像的语境信息研究[J].山西电子技术,2015(2):79-81.
5王姝华,曹阳,李佐,蔡士杰.连通区的页面分割与分类方法[J].计算机辅助设计与图形学学报,2002,14(1):17-20. 被引量：3
6程罗德,刘成安.高校数字化校园信息门户的设计与实现[J].电脑知识与技术,2008(5):689-692. 被引量：5
7刘仁金,高远飙,郝祥根.文本图像页面分割算法研究[J].中国科学技术大学学报,2010,40(5):500-504. 被引量：6
8王志军.通过百度首页接收邮箱通知[J].电脑迷,2011(20):78-78.
9吴婷婷.基于WAP的日语移动学习系统设计[J].自动化技术与应用,2017,36(3):32-35. 被引量：5
10瞿有利,于浩,徐国伟,西野文人.Web页面信息块的自动分割[J].中文信息学报,2004,18(1):6-13. 被引量：10

计算机学报

2015年第2期

浏览历史

内容加载中请稍等...

一种基于节点密度分割和标签传播的Web页面挖掘方法被引量：13

参考文献24

二级参考文献17

共引文献6

同被引文献92

引证文献13

二级引证文献54

相关作者

相关机构

相关主题

浏览历史

一种基于节点密度分割和标签传播的Web页面挖掘方法 被引量：13

参考文献24

二级参考文献17

共引文献6

同被引文献92

引证文献13

二级引证文献54

相关作者

相关机构

相关主题

浏览历史

一种基于节点密度分割和标签传播的Web页面挖掘方法被引量：13