摘要
针对目前Web信息抽取技术实现复杂、维护困难以及抽取速度慢的问题,本文根据Web页面的特点,提出一种新的Web抽取策略。此策略在处理Web页面时降低了处理Web页面的结构的复杂性,提高了Web信息抽取的速度。并根据策略建立了该Web信息自动抽取方法的模型,此模型首先分析页面的结构,根据结构快速生成抽取规则,构建规则库;并对页面抽取的内容进行分析,构建资源库。基于此模型的方法能自主学习,实现自动抽取,这在很大程度上减少了人工参与,并能获得比较好的抽取结果。
Aiming at the complex implementation, the maintenance of difficult and slow extraction of the Web information extraction technology at present, according to the features of Web pages, a new Web extraction strategy is proposed. When you deal with the Web pages, the strategy can reduce the complexity of the structure, and then the speed of Web information extraction in- creases. Based on the strategy a Web information extraction model is given. Using this model, the needed information can be extracted, at the same time the structure of Web pages is summarized and the rules are generated quickly, then the Rules Library is set up. And the page contents which have been extracted are analyzed, and then resources library is constructed. The model which based on the method has the ability to learn by itself and extracts the information automatically, and also it can reduce the artificial participation in a large degree, so the extracted result is relatively good.
出处
《计算机与现代化》
2009年第1期38-40,48,共4页
Computer and Modernization
关键词
WEB信息抽取
Web抽取策略
自主学习
抽取规则
Web information extraction
Web extraction strategy
autonomous learning
extraction rule