摘要
通过对各种Web信息抽取方式的分析,将一种新的抽取方法应用于电子期刊信息抽取.该方法首先应用文档结构相对路径结合节点内容特征进行相似度比较来完成对所需抽取信息块的精确定位;然后对于需要抽取出来的各个信息项则采用正则表达式构造文本信息项的特征模式;在此基础上,实现准确抽取.测试结果表明:基于Web的电子期刊元数据信息抽取方法在查全率和精确度方面高于一般的信息抽取方法,取得了比较令人满意的效果.
A novel method which was adopted to extract periodical metadata was proposed after various ways to extract the information from webs was analyzed.Before the metadata were extracted,those target information blocks were correctly extracted by using relative paths in document and the contents of nodes to jude similarity.According to the similarity,the target information blocks were located.Regular expressions were used to feature the text of the extracted information The experiment results showed the method ob...
出处
《华中科技大学学报(自然科学版)》
EI
CAS
CSCD
北大核心
2007年第12期13-15,共3页
Journal of Huazhong University of Science and Technology(Natural Science Edition)
基金
中国下一代互联网资助项目(CNGI-04-15-7A)
湖北省科技基础条件平台专项基金资助项目
武汉市科技攻关资助项目(20061002032)
关键词
信息抽取
包装器
模式匹配
电子期刊
information extraction
wrap
pattern matching
periodical metadata