期刊文献+

基于页面对比分析的数据提取 被引量:1

Data Extracting Based on the Page Comparison and Analysis
下载PDF
导出
摘要 针对提供大规模数据查询的Web页面,提出了一种基于站点内页面对比分析的Web数据提取方法。在对页面建树和分块的基础上对比分析获得页面数据块,然后利用同结构多页面对比和格式判断提取出数据,最后将数据存入到数据库中。该方法成功运用到多个信息提取系统中,实现了高效、准确的数据提取。 The Web based data service is expanding quickly with the dramatic expanse of Intemet. In this paper a Web data extraction method is proposed , which is based on Page Comparison and structure analysis Firstly it parses the semi-structured HTML documents and partitions it. Then Analysis relied on table structure can extract data from significative arca which is extracted through the similar Pages Comparison. Finally these data can be integrated into database. This approach has been efficiently and accurately applied in many retrieval systems.
作者 张聚弘 山岚
出处 《计算机与数字工程》 2006年第1期49-52,共4页 Computer & Digital Engineering
关键词 数据提取 页面结构 半结构化 data extracting, Web page structure, semi- structured
  • 相关文献

参考文献5

二级参考文献39

  • 1O Buyukkokten, H Garcia-Molina, A Paepcke. Accordion summarization for end-game browsing on PDAs and cellular phones. In: Proc of ACM Conf on Human Factors in Computing Systems(CHI 2001). New York: ACM Press, 2001. 213~220
  • 2Wang Tengjiao, Tang Shiwei, Yang Dongqing, et al. COMIIX:Towards effective WEB information extraction, integration and query answering. In: Proc of SIGMOD' 02. New York: ACM Press, 2002. 620
  • 3Liu Ling, Pu Calton, Han Wei. XWRAP: An XML-enabled wrapper construction system for Web information sources. In:Proc of the 16th Int'l Conf on Data Engineering. Washington:IEEE Computer Society Press, 2000. 611~621
  • 4R Baumgartner, S Flesca, G Gottlob. Visual Web information extraction with Lixto. In: Proc of the 27th Int'l Conf on Very Large Data Bases. San Francisco: Morgan Kaufmann, 2001. 119~ 128
  • 5D Freitag. Machine learning for information extraction in information domains. Machine Learning, 2000, 39 (2-3): 169 ~202
  • 6S SoderLan. Learning information extraction rules for semistructured and free text. Machine Learning, 1999, 34(1-3): 233~ 272
  • 7R D Doorenbos, O Etzioni, D S Weld. A scalable comparasonshopping agent for the World-Wide Web. In: ACM Agents' 97.New York: ACM Press, 1997. 39~48
  • 8D W Embley, et al. Conceptual-model-based data extraction from multiple-record Web pages. Data and Knowledge Engineering,1999, 31(3): 227~251
  • 9A Finn, A Kushmerick, B Smyth. Fact or fiction: Content classification for digital libraries. The 2nd DELOS Network of Excellence Workshop on Personalisation and Recommender Systems in Digital Libraries, Dublin, Ireland, 2001
  • 10S Gupta, G Kaiser, D Neistadt, et al. DOM-based content extraction of HTML documents. In: Proc of the 12th Int'l World-Wide Web Conf. New York: ACM Press, 2003. 207~214

共引文献100

同被引文献13

引证文献1

二级引证文献2

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部