期刊文献+

一种基于分类算法的网页信息提取方法 被引量:11

A Method of Web Information Extraction Based on Classification Algorithm
下载PDF
导出
摘要 在目前的Web信息提取技术中,很多都是基于HTML结构的,由于HTML结构的经常变化,使提取模板需要经常更新,而提取模板的更新需要很多领域知识。本文提出一种基于分类算法的Web信息提取方法,通过将网页文本按照其显示属性的不同进行分组,以显示属性值为基础对Web页面文本进行分类,获取所关注文本,从而完成对Web页面的信息提取。这种提取方法操作简单,易于实现,对网页结构的依赖性小。 In the research of Web information extraction, most of the existing algorithms are based on HTML struc-ture. As the structure of HTML files changes frequently, wrapper must be updated accordingly. But the update of wrapper needs a lot of domain knowledge. In this paper, a new Web information extraction method based on classification algorithm is provided, which can group the Web text by HTML text display attributes. The information extraction of Web pages is finished by classifying the Web text with different values of the display attributes and acquiring desired text. This algorithm is easy to implementation and small-dependent of the HTML structure. Experiments prove its good performance.
出处 《计算机科学》 CSCD 北大核心 2008年第3期91-93,共3页 Computer Science
基金 国家242基金(课题编号:2005B22,2006B20)
关键词 信息提取 属性向量 WRAPPER 显示属性 Web information extraction,Attribute vector,Wrapper,Display attributes
  • 相关文献

参考文献7

  • 1www. google.com
  • 2www. baidu.com
  • 3Chang C H, Kayed M, Girgis M R, Shaalan K. A Survey of Web Information Extraction Systems. IEEE Transactions on Knowledge and Data Engineering, 2006. 1411-1428
  • 4Laender A H F, et al. A Brief Survey of Web Data Extraction Tools. [J]. ACM SIGMOD Record,2002,31(2)
  • 5DENG Cai, YU Shipeng, WEN Jirong, et al. VIPS: A Vision- Based Page Segmentation Algorithm [R]: [Microsoft Technical Report, MSR-TR-2003-79]. 2003
  • 6Zhao Hongkun, Meng Weiyi, Yu C. Automatic Extraction of Dynamic Record Sections From Search Engine Result Pages VLDB 2006 Seoul,Korea
  • 7王琦,唐世渭,杨冬青,王腾蛟.基于DOM的网页主题信息自动提取[J].计算机研究与发展,2004,41(10):1786-1792. 被引量:81

二级参考文献13

  • 1O Buyukkokten, H Garcia-Molina, A Paepcke. Accordion summarization for end-game browsing on PDAs and cellular phones. In: Proc of ACM Conf on Human Factors in Computing Systems(CHI 2001). New York: ACM Press, 2001. 213~220
  • 2Wang Tengjiao, Tang Shiwei, Yang Dongqing, et al. COMIIX:Towards effective WEB information extraction, integration and query answering. In: Proc of SIGMOD' 02. New York: ACM Press, 2002. 620
  • 3Liu Ling, Pu Calton, Han Wei. XWRAP: An XML-enabled wrapper construction system for Web information sources. In:Proc of the 16th Int'l Conf on Data Engineering. Washington:IEEE Computer Society Press, 2000. 611~621
  • 4R Baumgartner, S Flesca, G Gottlob. Visual Web information extraction with Lixto. In: Proc of the 27th Int'l Conf on Very Large Data Bases. San Francisco: Morgan Kaufmann, 2001. 119~ 128
  • 5D Freitag. Machine learning for information extraction in information domains. Machine Learning, 2000, 39 (2-3): 169 ~202
  • 6S SoderLan. Learning information extraction rules for semistructured and free text. Machine Learning, 1999, 34(1-3): 233~ 272
  • 7R D Doorenbos, O Etzioni, D S Weld. A scalable comparasonshopping agent for the World-Wide Web. In: ACM Agents' 97.New York: ACM Press, 1997. 39~48
  • 8D W Embley, et al. Conceptual-model-based data extraction from multiple-record Web pages. Data and Knowledge Engineering,1999, 31(3): 227~251
  • 9A Finn, A Kushmerick, B Smyth. Fact or fiction: Content classification for digital libraries. The 2nd DELOS Network of Excellence Workshop on Personalisation and Recommender Systems in Digital Libraries, Dublin, Ireland, 2001
  • 10S Gupta, G Kaiser, D Neistadt, et al. DOM-based content extraction of HTML documents. In: Proc of the 12th Int'l World-Wide Web Conf. New York: ACM Press, 2003. 207~214

共引文献82

同被引文献86

引证文献11

二级引证文献34

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部