摘要
提出一种从科技文献等文档中自动抽取元数据的方法,将自动归纳法和相似特征度算法结合起来,基于特征相似的归纳学习算法自动生成抽取规则,并对文档进行元数据的自动抽取。这种方法利用文档自身某些特有属性,对文档的内容进行分块,利用归纳法自动生成抽取规则,并结合特征相似度对生成规则进行匹配,然后对文档元数据信息进行自动抽取,提高了自动生成规则的效率和抽取元数据信息的准确率。
This paper presents a new approach for extracting metadata from textual documents of scientific literatures.The approach combines automatic induction method with feature-similarity degree algorithm and automatically generates extraction rules according to feature-similar induction learning algorithm and extracts automatically the metadata from documents.This algorithm utilises some features of the documents of their own to divide the content of document into blocks,and uses induction method to automatically generate the extraction rules as well as matches these generated rules in conjunction with the feature similarity degree,then it automatically extract metadata information from textual documents,which improves the efficiency of automatic rule-generation and the precision rate of metadata extraction.
出处
《计算机应用与软件》
CSCD
2011年第12期148-150,共3页
Computer Applications and Software
关键词
元数据
归纳学习
机器学习
信息抽取
Metadata Induction learning Machine learning Information extraction