期刊文献+

基于网页分割的Web信息提取算法 被引量:2

Web information extraction algorithm based on Web page segmentation
下载PDF
导出
摘要 针对网页非结构化信息抽取复杂度高的问题,提出了一种基于网页分割的Web信息提取算法。对网页噪音进行预处理,根据网页的文档对象模型树结构进行标签路径聚类,通过自动训练的阈值和网页分割算法快速判定网页的关键部分,根据数据块中的嵌套结构获取网页文本提取模板。对不同类型网站的实验结果表明,该算法运行速度快、准确度高。 This paper proposes a Web information extraction algorithm based on Web division to solve the high complexity problem of unstructured information extraction. The method adopts Web noise pretreatment, carries on the tag path clustering according to the document object model tree structure of Web. The key part of the Web is determined rapidly through automatic training threshold value and Web page segmentation algorithm, and Web text extracted templates are obtained according to nesting structure in the data block. Experimental results on different kinds of Web sites show that the algorithm is fast and accurate.
出处 《微型机与应用》 2011年第5期54-56,共3页 Microcomputer & Its Applications
基金 广东省软科学研究项目(2009B070300052)
关键词 网页分割 信息提取 聚类 阈值 Web page segmentation information extraction clustering threshold
  • 相关文献

参考文献5

二级参考文献26

  • 1荆涛,左万利.基于可视布局信息的网页噪音去除算法[J].华南理工大学学报(自然科学版),2004,32(z1):84-87. 被引量:21
  • 2张志刚,陈静,李晓明.一种HTML网页净化方法[J].情报学报,2004,23(4):387-393. 被引量:57
  • 3常育红,姜哲,朱小燕.基于标记树表示方法的页面结构分析[J].计算机工程与应用,2004,40(16):129-132. 被引量:24
  • 4[1]Doorenbos R B, Etzioni O and Weld W S. A scalable comparisonshopping agent for the world_wide web [C]. Proceedings of the first international conference on Autonomous Agents, 1997:39~48.
  • 5[2]Embley D W, Jiang Y and Ng Y K. Record boundary discovery in web documents[C]. Proc. SIGMOD'99 , 1999: 467~478.
  • 6[3]David Buttler, Ling Liu and Calton Pu. A fully automated object extraction system for the world wide web[C]. International Conference on Distributed Computing Systems, 2001.
  • 7[4]Kushmerick N, Weld D, Doorenbos R. Wrapper induction for Information extraction[C]. Proc. IJCAI 97, 1997.
  • 8[5]Muslea I, Minton S and Knoblock C. A hierarchical approach to Wrapper induction[C]. Proc. 3rd International Conference Autonomous Agents, 1999.
  • 9[6]Arnaud Sahuguet, Fabien Azavant. Taming Web sources with "minute_made" wrappers[M]. Unpublished, 1999.
  • 10[7]Craven M, DiPasquo D, Freitag D, McCallum A, Mitchell T,Nigam N, Lattery S S. Learning to extract symbolic knowledge from the World Wide Web[C]. Proc. AAAI-98, 1998.

共引文献113

同被引文献11

  • 1于满泉,陈铁睿,许洪波.基于分块的网页信息解析器的研究与设计[J].计算机应用,2005,25(4):974-976. 被引量:55
  • 2王芳,于浩,谭红叶,赵铁军.基于链接分块的相关链接提取方法[J].计算机工程与应用,2006,42(31):110-113. 被引量:2
  • 3Cai D, Yu S, Wen J R, et al. VIPS: Improving Pseudo- Relevance Feedback in Web Information Retrieval Using Web Page Segmentation [ C ]//Proceeding of The 12th International Conference on World Wide Web,2003.
  • 4Abel O, Li Longzhuang, Liu Yonghuai. Visual Segmen- tation-Based Data Record Extraction from Web Documents [ C ]//Proceedings of IEEE International Conference on Information Reuse and Integration, 2007: 502-507.
  • 5Kovacevic M, Diligenti M, Coil M, et al. Recognition of Common Areas in a Web Page Using Visual Information : a possible application in a page classification [ C ]//In Proceedings of the 2002 IEEE International Conference on Data Mining (ICDM2002) Maebashi City. Japan. 2002 : 250-257.
  • 6Bille P. A survey on tree edit distance and relatedproblems [ J ]. Theoretical Computer Science, 2005,337 (1-3) :217-239.
  • 7Liu B, Grossman RL, Zhai Y pages [ C ]//Proc. Of the Discovery and Data Mining ACM Press ,2003:601-606. Mining data records in Web Int' 1 Conf on Knowledge ( KDD 2003 ). Washington :.
  • 8FU YAN,YANG DONG2Q ING,TANG SH I2W E I.U sing XPath to discover informative content blocks of W eb pages[C]//3 rd International Conference on Semantics:Knowledge and Grid.Xiπan:IEEE Press,2007:450-453.
  • 9陈翰生,曾剑平,张世永.一种基于位置信息的Web页面分割方法[J].计算机应用与软件,2009,26(7):155-159. 被引量:3
  • 10戴慧敏,朱艳辉,唐杰.Web信息抽取技术研究[J].科技信息,2013(6):320-320. 被引量:1

引证文献2

二级引证文献2

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部