摘要
提出一种新的网页信息抽取方法,基于子树的广度可不加区分地对不同科技文献网站的页面信息进行自动抽取。对大量科技文献网站进行信息抽取实验,已应用到甘肃省科技文献共享平台。实验结果证明,该方法能不依赖科技文献网页的来源而自动地抽取相关信息,并能保证较高的数据抽取回召率和查准率。
This paper proposes a new method which can extract the useful information from the different document sites automatically based on the breadth of a sub-tree. Experimental evaluation on a large of Web pages from different document Web sites has done and this method has been applied to the platform of gansu science & technology document sharing successfully. Experimental result shows this method automatically extracts the information ignoring where Web sites the pages come from and has high accuracy in terms of recall and precision.
出处
《计算机工程》
CAS
CSCD
北大核心
2009年第3期89-90,93,共3页
Computer Engineering
基金
甘肃省技术研究与开发专项计划基金资助项目(2007GS05285)
关键词
子树广度
信息抽取
跨库检索
sub-tree breadth
information extraction
cross-search