期刊文献+

基于GATE语义标注的Web信息的自动抽取 被引量:4

Automatic Web Information Extraction Based on GATE Semantic Annotation
原文传递
导出
摘要 重点研究基于语义标注样本的Web信息自动抽取的实现方法。借助自然语言处理框架GATE,首先引入领域本体对样本网页内容进行语义标注,精确定位出待抽取的语义项,并据此将样本网页解析为S-DOM树。从S-DOM树中抽取出语义项的特征描述,形成样本实例并采用机器学习算法归纳抽取规则,自动生成包装器。抽取过程中,通过比较网页结构的相似度,系统能够感知网页的变化,主动学习并扩展规则库。试验结果表明,由于精确定位保障了学习样本的质量,小样本学习生成的包装器能够达到较为理想的查全率和查准率。 Automatic Web Information Extraction is studied in the paper. By using GATE, an infrastructure for developing and deploying software components that process natural language, domain knowledge come from domain ontology is used for semantic annotation. To begin with, training pages are parsed from S-DOM trees after target extraction data are labeled precisely. As training data, features of the target data extracted from the S-DOM trees will be fed to rule learner module, extraction rules are induced automatically by machine learning. In the process of extraction, a self-adaptive function is designed. The difference of web pages can be detected by checking web page similarity. According to the checking result, rule learner can do learning instructions positively, extend and update the rule-set automatically as well. Our experiment shows that the high quality learning sample obtained by precisely semantic labeling make it possible to get a desired recall and precision even with small number of sample pages.
作者 聂卉 黄贵鹏
出处 《图书情报工作》 CSSCI 北大核心 2010年第5期110-114,共5页 Library and Information Service
基金 教育部人文社会科学研究项目"基于信息抽取的数字图书馆的知识获取研究"(项目批准号:08JC870013)研究成果之一
关键词 WEB信息抽取 语义标注 包装器 Web information extraction semantic annotation wrapper
  • 相关文献

参考文献5

  • 1GATE HOME. [2008 - 12 -05]. http://gate. ac. uk/.
  • 2Liu L,Han W, Butler D, et al. An XJML-based wrapper generator for web information extraetion//Proeeedings of the 1999 ACM SIG- MOD International Conference on Management of Data, Philadel- phia, 1999:540 - 543.
  • 3Zheng Shuyi, Song Ruihua, Wen Jirong, et al. Joint optimization of wrapper generation and template detectiort//Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Jose,2007:894 - 902.
  • 4Chunang S L, Hsu J Y. Tree-structured template generation for web paged/Proceedings of the 2004 IEEE/WIC/ACM International Conference an Web Intelligence, Washington, 2004 : 327 - 333.
  • 5Reis D C, GIgher P B, Silva A S, et al. Automatic web news extraction using tree edit distance//Proceedings of the 13th International Conference on World Wide Web, New York, 2004:502 -511.

同被引文献26

引证文献4

二级引证文献7

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部