摘要
Web页面所表达的主要信息通常隐藏在大量无关的结构和文字中 ,使用户不能迅速获取主题信息 ,限制了Web的可用性 ,信息提取有助于解决这一问题 基于DOM规范 ,针对HTML的半结构化特征和缺乏语义描述的不足 ,提出含有语义信息的STU DOM树模型 将HTML文档转换为STU DOM树 ,并对其进行基于结构的过滤和基于语义的剪枝 ,能够准确地提取出主题信息 方法不依赖于信息源 ,而且不改变源网页的结构和内容 ,是一种自动、可靠和通用的方法 具有可观的应用价值 。
Web is a vast resource of information, but its representation limits its availability: the main information in a web page is always hidden among unimportant features such as unnecessary images and extraneous links, and this makes it difficult for the users to acquire the topical information Information extraction can help the users to locate the information of interest A new extraction methodology based on DOM is proposed by transforming DOM trees to STU DOM trees and then processing them with some algorithms A STU DOM tree can be viewed as a DOM tree with some semantic contextual attributes The key algorithm is to filter and prune the STU DOM tree It can automatically and accurately extract the useful and relevant content from HTML documents This approach is a universal method, which is independent of document structures and domains Unlike most approaches, it maintains the structure and content as well Hence the approach is significant and reliable It can be widely applied for web browsing on handheld devices, such as PDAs and mobile phones, and retrieval systems
出处
《计算机研究与发展》
EI
CSCD
北大核心
2004年第10期1786-1792,共7页
Journal of Computer Research and Development
基金
国家"九七三"重点基础研究发展规划基金项目 (G19990 3 2 70 5 )
国家"八六三"高技术研究发展计划基金项目数据库管理系统及其应用重大专项课题 ( 2 0 0 2AA4Z3 440 )