基于网页分割的Web信息提取算法被引量：2

Web information extraction algorithm based on Web page segmentation

下载PDF

导出

摘要针对网页非结构化信息抽取复杂度高的问题,提出了一种基于网页分割的Web信息提取算法。对网页噪音进行预处理,根据网页的文档对象模型树结构进行标签路径聚类,通过自动训练的阈值和网页分割算法快速判定网页的关键部分,根据数据块中的嵌套结构获取网页文本提取模板。对不同类型网站的实验结果表明,该算法运行速度快、准确度高。 This paper proposes a Web information extraction algorithm based on Web division to solve the high complexity problem of unstructured information extraction. The method adopts Web noise pretreatment, carries on the tag path clustering according to the document object model tree structure of Web. The key part of the Web is determined rapidly through automatic training threshold value and Web page segmentation algorithm, and Web text extracted templates are obtained according to nesting structure in the data block. Experimental results on different kinds of Web sites show that the algorithm is fast and accurate.

作者侯明燕杨天奇

机构地区暨南大学计算机科学系

出处《微型机与应用》 2011年第5期54-56,共3页 Microcomputer & Its Applications

基金广东省软科学研究项目(2009B070300052)

关键词网页分割信息提取聚类阈值 Web page segmentation information extraction clustering threshold

分类号 TP311.5 [自动化与计算机技术—计算机软件与理论]

引文网络
相关文献

参考文献5

1欧健文,董守斌,蔡斌.模板化网页主题信息的提取方法[J].清华大学学报（自然科学版）,2005,45(S1):1743-1747. 被引量：70
2杨少华,林海略,韩燕波.针对模板生成网页的一种数据自动抽取方法(英文)[J].软件学报,2008,19(2):209-223. 被引量：45
3GUPTA S, KAISER G, NEISTADT D, et al. DOM-based content extraction of HTML documents [C]. Proceedings of the 12th Word Wide Web Conference New York, USA: [s. n.], 2003.
4PELLEG D, BARAS D. K-means with large and noisy constraint sets [C]. Proceedings of the 18th European Conference on Machine Learning. Warsaw, Poland:[s. n.], 2007.
5于琨,蔡智,糜仲春,蔡庆生.基于路径学习的信息自动抽取方法[J].小型微型计算机系统,2003,24(12):2147-2149. 被引量：7

二级参考文献26

1荆涛,左万利.基于可视布局信息的网页噪音去除算法[J].华南理工大学学报（自然科学版）,2004,32(z1):84-87. 被引量：21
2张志刚,陈静,李晓明.一种HTML网页净化方法[J].情报学报,2004,23(4):387-393. 被引量：57
3常育红,姜哲,朱小燕.基于标记树表示方法的页面结构分析[J].计算机工程与应用,2004,40(16):129-132. 被引量：24
4[1]Doorenbos R B, Etzioni O and Weld W S. A scalable comparisonshopping agent for the world_wide web [C]. Proceedings of the first international conference on Autonomous Agents, 1997:39～48.
5[2]Embley D W, Jiang Y and Ng Y K. Record boundary discovery in web documents[C]. Proc. SIGMOD'99 , 1999: 467～478.
6[3]David Buttler, Ling Liu and Calton Pu. A fully automated object extraction system for the world wide web[C]. International Conference on Distributed Computing Systems, 2001.
7[4]Kushmerick N, Weld D, Doorenbos R. Wrapper induction for Information extraction[C]. Proc. IJCAI 97, 1997.
8[5]Muslea I, Minton S and Knoblock C. A hierarchical approach to Wrapper induction[C]. Proc. 3rd International Conference Autonomous Agents, 1999.
9[6]Arnaud Sahuguet, Fabien Azavant. Taming Web sources with "minute_made" wrappers[M]. Unpublished, 1999.
10[7]Craven M, DiPasquo D, Freitag D, McCallum A, Mitchell T,Nigam N, Lattery S S. Learning to extract symbolic knowledge from the World Wide Web[C]. Proc. AAAI-98, 1998.

共引文献113

1王哲.基于特征码的网页去重算法研究[J].山东电大学报,2009(1):14-16. 被引量：4
2郑长松,傅彦,佘莉.基于模板的Web信息自动提取方法[J].计算机应用研究,2009,26(2):570-572. 被引量：10
3赵靖,王侨文,管马周,单传佳.自动提取布局结构相似网页的结构化信息[J].安徽科技学院学报,2010,24(6):37-42. 被引量：1
4许文,都云程,李渝勤,施水才.一种通用HTML网页主题信息提取方法[J].现代图书情报技术,2007(1):40-43. 被引量：11
5刘佳宾,胡国平,陈超,邵正荣.基于决策树和马尔可夫链的问答对自动提取[J].中文信息学报,2007,21(2):46-51. 被引量：5
6刘晨曦,吴扬扬.一种基于块分析的网页去噪音方法[J].广西师范大学学报（自然科学版）,2007,25(2):149-152. 被引量：8
7冯少卿,都云程.网页结构模板生成新方法研究[J].北京机械工业学院学报,2007,22(3):15-19. 被引量：2
8张恒,屈景辉,张亮.网页文本信息提取及结果评价[J].微计算机应用,2007,28(9):921-924. 被引量：10
9时达明,林鸿飞,杨志豪.基于网页框架和规则的网页噪音去除方法[J].计算机工程,2007,33(19):276-278. 被引量：17
10于鲁波,陈超.互联网商品信息抽取技术[J].计算机工程,2008,34(5):274-276. 被引量：5

同被引文献11

1于满泉,陈铁睿,许洪波.基于分块的网页信息解析器的研究与设计[J].计算机应用,2005,25(4):974-976. 被引量：55
2王芳,于浩,谭红叶,赵铁军.基于链接分块的相关链接提取方法[J].计算机工程与应用,2006,42(31):110-113. 被引量：2
3Cai D, Yu S, Wen J R, et al. VIPS: Improving Pseudo- Relevance Feedback in Web Information Retrieval Using Web Page Segmentation [ C ]//Proceeding of The 12th International Conference on World Wide Web,2003.
4Abel O, Li Longzhuang, Liu Yonghuai. Visual Segmen- tation-Based Data Record Extraction from Web Documents [ C ]//Proceedings of IEEE International Conference on Information Reuse and Integration, 2007: 502-507.
5Kovacevic M, Diligenti M, Coil M, et al. Recognition of Common Areas in a Web Page Using Visual Information : a possible application in a page classification [ C ]//In Proceedings of the 2002 IEEE International Conference on Data Mining (ICDM2002) Maebashi City. Japan. 2002 : 250-257.
6Bille P. A survey on tree edit distance and relatedproblems [ J ]. Theoretical Computer Science, 2005,337 (1-3) :217-239.
7Liu B, Grossman RL, Zhai Y pages [ C ]//Proc. Of the Discovery and Data Mining ACM Press ,2003:601-606. Mining data records in Web Int' 1 Conf on Knowledge ( KDD 2003 ). Washington :.
8FU YAN,YANG DONG2Q ING,TANG SH I2W E I.U sing XPath to discover informative content blocks of W eb pages[C]//3 rd International Conference on Semantics:Knowledge and Grid.Xiπan:IEEE Press,2007:450-453.
9陈翰生,曾剑平,张世永.一种基于位置信息的Web页面分割方法[J].计算机应用与软件,2009,26(7):155-159. 被引量：3
10戴慧敏,朱艳辉,唐杰.Web信息抽取技术研究[J].科技信息,2013(6):320-320. 被引量：1

引证文献2

1于洪涛,王冬青,张付志.基于网页分块和链接特征的卷期目录链接提取方法[J].情报学报,2012,31(7):686-693. 被引量：1
2万文兵.基于主题型页面的正文信息抽取技术研究[J].计算机光盘软件与应用,2015,18(1):15-16. 被引量：1

二级引证文献2

1苏秀芝.基于网页Title标签的正文提取方法[J].福建电脑,2016,32(4):43-44.
2龙科,李伟强,卢来.基于网页分块的科技信息采集系统的设计与实现[J].电脑迷,2017(3):179-180.

1刘云峰.基于标签路径聚类的文本信息抽取算法[J].计算机工程,2010,36(12):83-84. 被引量：1
2刘云峰.一种基于标签路径聚类的文本信息抽取算法[J].计算机应用与软件,2010,27(11):199-202. 被引量：2
3陈明,孙丽丽.基于WAP的移动搜索模型[J].计算机工程,2008,34(3):205-206. 被引量：6
4于鲁波,陈超.互联网商品信息抽取技术[J].计算机工程,2008,34(5):274-276. 被引量：5
5孙晓辉,刘建,王劲林,陈晓.基于CSS的网页分割算法[J].微计算机应用,2008,29(9):46-51. 被引量：4
6沈达峰.基于网页分割的语义信息检索研究[J].西昌学院学报（自然科学版）,2009,23(4):57-61.
7俞扬信,严云洋.一种基于网页分割的Web信息检索方法[J].图书情报工作,2009,53(3):108-110. 被引量：3
8彭红超,童名文,邹军华,郝秋红.基于规则的网页分割预处理算法研究[J].计算机科学,2013,40(11A):379-382. 被引量：1
9王实,高文,李锦涛,谢辉.路径聚类:在Web站点中的知识发现[J].计算机研究与发展,2001,38(4):482-486. 被引量：59
10段昕,马军,宋玲.利用分块重要度进行中文网页分类的研究[J].山东大学学报（理学版）,2006,41(3):1-4.

微型机与应用

2011年第5期

浏览历史

内容加载中请稍等...

基于网页分割的Web信息提取算法被引量：2

参考文献5

二级参考文献26

共引文献113

同被引文献11

引证文献2

二级引证文献2

相关作者

相关机构

相关主题

浏览历史

基于网页分割的Web信息提取算法 被引量：2

参考文献5

二级参考文献26

共引文献113

同被引文献11

引证文献2

二级引证文献2

相关作者

相关机构

相关主题

浏览历史

基于网页分割的Web信息提取算法被引量：2