
基于DOM的网页主题信息自动提取 被引量:81

DOM-Based Automatic Extraction of Topical Information from Web Pages
摘要 Web页面所表达的主要信息通常隐藏在大量无关的结构和文字中 ,使用户不能迅速获取主题信息 ,限制了Web的可用性 ,信息提取有助于解决这一问题 基于DOM规范 ,针对HTML的半结构化特征和缺乏语义描述的不足 ,提出含有语义信息的STU DOM树模型 将HTML文档转换为STU DOM树 ,并对其进行基于结构的过滤和基于语义的剪枝 ,能够准确地提取出主题信息 方法不依赖于信息源 ,而且不改变源网页的结构和内容 ,是一种自动、可靠和通用的方法 具有可观的应用价值 。 Web is a vast resource of information, but its representation limits its availability: the main information in a web page is always hidden among unimportant features such as unnecessary images and extraneous links, and this makes it difficult for the users to acquire the topical information Information extraction can help the users to locate the information of interest A new extraction methodology based on DOM is proposed by transforming DOM trees to STU DOM trees and then processing them with some algorithms A STU DOM tree can be viewed as a DOM tree with some semantic contextual attributes The key algorithm is to filter and prune the STU DOM tree It can automatically and accurately extract the useful and relevant content from HTML documents This approach is a universal method, which is independent of document structures and domains Unlike most approaches, it maintains the structure and content as well Hence the approach is significant and reliable It can be widely applied for web browsing on handheld devices, such as PDAs and mobile phones, and retrieval systems
出处 《计算机研究与发展》 EI CSCD 北大核心 2004年第10期1786-1792,共7页 Journal of Computer Research and Development
基金 国家"九七三"重点基础研究发展规划基金项目 (G19990 3 2 70 5 ) 国家"八六三"高技术研究发展计划基金项目数据库管理系统及其应用重大专项课题 ( 2 0 0 2AA4Z3 440 )
关键词 DOM 信息提取 分块 STU STU树 STU-DOM树 相关度 DOM information extraction partition STU STU tree STU-DOM tree correlativity
  • 相关文献


  • 1O Buyukkokten, H Garcia-Molina, A Paepcke. Accordion summarization for end-game browsing on PDAs and cellular phones. In: Proc of ACM Conf on Human Factors in Computing Systems(CHI 2001). New York: ACM Press, 2001. 213~220
  • 2Wang Tengjiao, Tang Shiwei, Yang Dongqing, et al. COMIIX:Towards effective WEB information extraction, integration and query answering. In: Proc of SIGMOD' 02. New York: ACM Press, 2002. 620
  • 3Liu Ling, Pu Calton, Han Wei. XWRAP: An XML-enabled wrapper construction system for Web information sources. In:Proc of the 16th Int'l Conf on Data Engineering. Washington:IEEE Computer Society Press, 2000. 611~621
  • 4R Baumgartner, S Flesca, G Gottlob. Visual Web information extraction with Lixto. In: Proc of the 27th Int'l Conf on Very Large Data Bases. San Francisco: Morgan Kaufmann, 2001. 119~ 128
  • 5D Freitag. Machine learning for information extraction in information domains. Machine Learning, 2000, 39 (2-3): 169 ~202
  • 6S SoderLan. Learning information extraction rules for semistructured and free text. Machine Learning, 1999, 34(1-3): 233~ 272
  • 7R D Doorenbos, O Etzioni, D S Weld. A scalable comparasonshopping agent for the World-Wide Web. In: ACM Agents' 97.New York: ACM Press, 1997. 39~48
  • 8D W Embley, et al. Conceptual-model-based data extraction from multiple-record Web pages. Data and Knowledge Engineering,1999, 31(3): 227~251
  • 9A Finn, A Kushmerick, B Smyth. Fact or fiction: Content classification for digital libraries. The 2nd DELOS Network of Excellence Workshop on Personalisation and Recommender Systems in Digital Libraries, Dublin, Ireland, 2001
  • 10S Gupta, G Kaiser, D Neistadt, et al. DOM-based content extraction of HTML documents. In: Proc of the 12th Int'l World-Wide Web Conf. New York: ACM Press, 2003. 207~214











使用帮助 返回顶部