摘要
本文介绍了一种结合STU-DOM树和文本字数统计的新闻网页正文信息抽取方法,它赋予节点语义属性后,过滤主题无关节点,最终利用相关标记包含的中文字符数从中选择包含正文信息的节点。这种策略能够准确地提取正文内容外,还无损地保留了正文中与主题相关的链接。
A method for text extraction from news web pages is introduced, which is based upon the STU-DOM tree and statistics of words’ number. The key algorithm is to give a node two semantic contextual attributes and f ilter the nodes which is not related subject. Then choose the node which contains text content by using the number of the Chinese characters in each node of the STU-DOM tree. The strategy can not only extract the useful and relevant text content from HTML documents, but also retain the related subject link in the text.
作者
陈蕾蕾
张如静
Chen Leilei1 Zhang Rujing2 (Department of Educational Technology, Nanjing Normal University, Nanjing 210097,China)
出处
《电脑知识与技术》
2008年第S2期1-2,共2页
Computer Knowledge and Technology
关键词
统计
STU树
信息抽取
statistical
STU-DOM tree
information extraction