

The Strategy of the Information Extraction for Text from News Web Pages
摘要 本文介绍了一种结合STU-DOM树和文本字数统计的新闻网页正文信息抽取方法,它赋予节点语义属性后,过滤主题无关节点,最终利用相关标记包含的中文字符数从中选择包含正文信息的节点。这种策略能够准确地提取正文内容外,还无损地保留了正文中与主题相关的链接。 A method for text extraction from news web pages is introduced, which is based upon the STU-DOM tree and statistics of words’ number. The key algorithm is to give a node two semantic contextual attributes and f ilter the nodes which is not related subject. Then choose the node which contains text content by using the number of the Chinese characters in each node of the STU-DOM tree. The strategy can not only extract the useful and relevant text content from HTML documents, but also retain the related subject link in the text.
作者 陈蕾蕾 张如静 Chen Leilei1 Zhang Rujing2 (Department of Educational Technology, Nanjing Normal University, Nanjing 210097,China)
出处 《电脑知识与技术》 2008年第S2期1-2,共2页 Computer Knowledge and Technology
关键词 统计 STU树 信息抽取 statistical STU-DOM tree information extraction
  • 相关文献



  • 1O Buyukkokten, H Garcia-Molina, A Paepcke. Accordion summarization for end-game browsing on PDAs and cellular phones. In: Proc of ACM Conf on Human Factors in Computing Systems(CHI 2001). New York: ACM Press, 2001. 213~220
  • 2Wang Tengjiao, Tang Shiwei, Yang Dongqing, et al. COMIIX:Towards effective WEB information extraction, integration and query answering. In: Proc of SIGMOD' 02. New York: ACM Press, 2002. 620
  • 3Liu Ling, Pu Calton, Han Wei. XWRAP: An XML-enabled wrapper construction system for Web information sources. In:Proc of the 16th Int'l Conf on Data Engineering. Washington:IEEE Computer Society Press, 2000. 611~621
  • 4R Baumgartner, S Flesca, G Gottlob. Visual Web information extraction with Lixto. In: Proc of the 27th Int'l Conf on Very Large Data Bases. San Francisco: Morgan Kaufmann, 2001. 119~ 128
  • 5D Freitag. Machine learning for information extraction in information domains. Machine Learning, 2000, 39 (2-3): 169 ~202
  • 6S SoderLan. Learning information extraction rules for semistructured and free text. Machine Learning, 1999, 34(1-3): 233~ 272
  • 7R D Doorenbos, O Etzioni, D S Weld. A scalable comparasonshopping agent for the World-Wide Web. In: ACM Agents' 97.New York: ACM Press, 1997. 39~48
  • 8D W Embley, et al. Conceptual-model-based data extraction from multiple-record Web pages. Data and Knowledge Engineering,1999, 31(3): 227~251
  • 9A Finn, A Kushmerick, B Smyth. Fact or fiction: Content classification for digital libraries. The 2nd DELOS Network of Excellence Workshop on Personalisation and Recommender Systems in Digital Libraries, Dublin, Ireland, 2001
  • 10S Gupta, G Kaiser, D Neistadt, et al. DOM-based content extraction of HTML documents. In: Proc of the 12th Int'l World-Wide Web Conf. New York: ACM Press, 2003. 207~214









使用帮助 返回顶部