期刊文献+

基于模式匹配的结构化信息抽取 被引量:6

Structured Information Extraction Based on Pattern Matching
下载PDF
导出
摘要 针对半结构化文本的信息抽取粒度较大,不能对抽取结果进行有效语义分析的问题,面向领域提出一种基于模式匹配的结构化信息二次抽取方法.该方法以Web文档形式呈现的半结构化文本为对象,对粗粒度抽取结果进行领域识别,根据识别结果加载相应领域词库.根据模式中各个角色的词性实现模式角色到分词序列词语的映射,从分词序列中抽取出结构化信息,为准确的语义分析提供支持.实验表明该方法能获得更准确的抽取结果. The information extraction results extracted from the semi-structured texts are coarse-grained, which results in ineffective semantic analysis. A structured information extraction method based on pattern matching is proposed. The proposed method is targeted at the web-presented semi-structured texts, and the suitable lexicon is loaded through domain recognition of the coarse-grained extraction results. Roles are mapped to the corresponding words in the word sequence according to the part of speech of the role in the patterns. Thus, the structured information can be extracted and it provides support for the accurate semantic analysis. Experiments show more accurate extraction results can be achieved by the proposed method.
出处 《模式识别与人工智能》 EI CSCD 北大核心 2014年第8期758-768,共11页 Pattern Recognition and Artificial Intelligence
基金 国家自然科学基金项目(No.60975033 60575035 60275022)资助
关键词 半结构化文本 模式匹配 结构化信息 粗粒度抽取结果 领域识别 Semi-structured Text Pattern Matching Structured Information Coarse-Grained Extraction Result Domain Recognition
  • 相关文献

参考文献13

二级参考文献101

  • 1王金凤.一种基于特征聚合理论和LSI的文本分类新方法[J].北京理工大学学报(社会科学版),2004,6(5):92-94. 被引量:2
  • 2程传鹏.中文网页分类的研究与实现[J].中原工学院学报,2007,18(1):61-64. 被引量:13
  • 3史忠植.智能主体及其应用[M].北京:科学出版社,2001.7-11.
  • 4[1]Baumgartner R.,Flesca S.,Gottlob G.. Visual web information extraction with lixto. In: Proceedings of the 27th International Conference on Very Large Data Bases, Roma, 2001,119~128
  • 5[2]Liu L.,Pu C., Han W.. XWRAP: An XML-enabled wrapper construction system for web information sources. In: Proceedings of the 16th International Conference on Data Engineering, California, 2000, 611~621
  • 6[3]Gottlob G., Koch C.. Monadic datalog and the expressive power of languages for web Information extraction. In: Proceedings of the 21th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, Wisconsin, 2002, 17~28
  • 7[4]Hamer J.,Brennig M., Garcia-Molina H.. Template-based wrappers in the TSIMMIS system. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, Arizona, 1997, 532~535
  • 8[5]Atzeni P., Mecca G.. Cut and paste. In: Proceedings of the 16th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, Arizona, 1997, 144~153
  • 9[6]Crescenzi V., Mecca G., Merialdo P.. RoadRunner: Towards automatic data extraction from large web sites. In: Proceedings of the 27th International Conference on Very Large Data Bases, Roma, 2001, 109~118
  • 10[7]Soderland S.. Learning information extraction rules for semistructured and free text. Machine Learning,1999, 34(1~3):233~272

共引文献229

同被引文献52

引证文献6

二级引证文献34

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部