摘要
重点研究基于语义标注样本的Web信息自动抽取的实现方法。借助自然语言处理框架GATE,首先引入领域本体对样本网页内容进行语义标注,精确定位出待抽取的语义项,并据此将样本网页解析为S-DOM树。从S-DOM树中抽取出语义项的特征描述,形成样本实例并采用机器学习算法归纳抽取规则,自动生成包装器。抽取过程中,通过比较网页结构的相似度,系统能够感知网页的变化,主动学习并扩展规则库。试验结果表明,由于精确定位保障了学习样本的质量,小样本学习生成的包装器能够达到较为理想的查全率和查准率。
Automatic Web Information Extraction is studied in the paper. By using GATE, an infrastructure for developing and deploying software components that process natural language, domain knowledge come from domain ontology is used for semantic annotation. To begin with, training pages are parsed from S-DOM trees after target extraction data are labeled precisely. As training data, features of the target data extracted from the S-DOM trees will be fed to rule learner module, extraction rules are induced automatically by machine learning. In the process of extraction, a self-adaptive function is designed. The difference of web pages can be detected by checking web page similarity. According to the checking result, rule learner can do learning instructions positively, extend and update the rule-set automatically as well. Our experiment shows that the high quality learning sample obtained by precisely semantic labeling make it possible to get a desired recall and precision even with small number of sample pages.
出处
《图书情报工作》
CSSCI
北大核心
2010年第5期110-114,共5页
Library and Information Service
基金
教育部人文社会科学研究项目"基于信息抽取的数字图书馆的知识获取研究"(项目批准号:08JC870013)研究成果之一
关键词
WEB信息抽取
语义标注
包装器
Web information extraction semantic annotation wrapper