摘要
针对半结构化文本的信息抽取粒度较大,不能对抽取结果进行有效语义分析的问题,面向领域提出一种基于模式匹配的结构化信息二次抽取方法.该方法以Web文档形式呈现的半结构化文本为对象,对粗粒度抽取结果进行领域识别,根据识别结果加载相应领域词库.根据模式中各个角色的词性实现模式角色到分词序列词语的映射,从分词序列中抽取出结构化信息,为准确的语义分析提供支持.实验表明该方法能获得更准确的抽取结果.
The information extraction results extracted from the semi-structured texts are coarse-grained, which results in ineffective semantic analysis. A structured information extraction method based on pattern matching is proposed. The proposed method is targeted at the web-presented semi-structured texts, and the suitable lexicon is loaded through domain recognition of the coarse-grained extraction results. Roles are mapped to the corresponding words in the word sequence according to the part of speech of the role in the patterns. Thus, the structured information can be extracted and it provides support for the accurate semantic analysis. Experiments show more accurate extraction results can be achieved by the proposed method.
出处
《模式识别与人工智能》
EI
CSCD
北大核心
2014年第8期758-768,共11页
Pattern Recognition and Artificial Intelligence
基金
国家自然科学基金项目(No.60975033
60575035
60275022)资助
关键词
半结构化文本
模式匹配
结构化信息
粗粒度抽取结果
领域识别
Semi-structured Text Pattern Matching Structured Information Coarse-Grained Extraction Result Domain Recognition