期刊文献+

基于概率模型的Web信息抽取 被引量:4

Web Information Extraction Based on Probabilistic Model
原文传递
导出
摘要 针对Web网页的二维结构和内容的特点,提出一种树型结构分层条件随机场(TH-CRFs)来进行Web对象的抽取.首先,从网页结构和内容两个方面使用改进多特征向量空间模型来表示网页的特征;第二,引入布尔模型和多规则属性来更好地表示Web对象结构与语义的特征;第三,利用TH-CRFs来进行Web对象的信息提取,从而找出相关的招聘信息并优化模型训练的效率.通过实验并与现有的Web信息抽取模型对比,结果表明,基于TH-CRFs的Web信息抽取的准确率已有效改善,同时抽取的时间复杂度也得到降低. According to the structure and the content features of web pages,a model named tree-structured hierarchical conditional random fields(TH-CRFs) is proposed.Firstly,a multi-feature vector space model is proposed to represent the features of the web pages from the facets of the page structure and the content.Secondly,the Boolean model and multi-rules are introduced to denote the features for a better representation of the web objects.Thirdly,an optimal web objects information extraction based on the TH-CRFs is performed to find out the recruitment knowledge and optimize the efficiency of the training.Finally,the proposed model is compared with the existing approaches for web objects information extraction.The experimental results show that the accuracy of the TH-CRFs for the web objects information extraction is significantly improved,and the time complexity is decreased.
作者 王静 刘志镜
出处 《模式识别与人工智能》 EI CSCD 北大核心 2010年第6期847-855,共9页 Pattern Recognition and Artificial Intelligence
基金 国家科技支撑计划项目资助(No.2007BAH08B02)
关键词 WEB对象 条件随机场(CRFs) 信息抽取(IE) Web Object Conditional Random Fields(CRFs) Information Extraction(IE)
  • 相关文献

参考文献20

  • 1Cui Hang,Kan M Y,Chua T S.Soft Pattern Matching Models for Definitional Question Answering.ACM Trans on Information Systems,2007,25(2):1-30.
  • 2Nyberg E,Mitamura T,Callan J,et al.The JAVELIN Question-Answering System at TREC 2003:A Multi-Strategy Approach with Dynamic Planning // Proc of the 12th Text Retrieval Conference.Edinburgh,UK,2003,Ⅻ:93-108.
  • 3Mooney R J,Bunescu R.Mining Knowledge from Text Using Information Extraction.ACM SIGKDD Explorations Newsletter,2005,7(1):3-10.
  • 4Kobayashi N,Iida R,Inui K,et al.Opinion Mining on the Web by Extracting Subject-Attribute-Value Relations // Proc of the AAAI Spring Symposium on Computational Approaches to Analyzing Weblogs.California,USA,2006:470-481.
  • 5Chen Jinlin,Zhong Ping,Cook T.Detecting Web Content Function Using Generalized Hidden Markov Model // Proc of the 5th International Conference on Machine Learning and Applications.Orlando,USA,2006:279-284.
  • 6Freitag D,McCallum A.Information Extraction with HMM Structures Learned by Stochastic Optimization // Proc of the 17th National Conference on Artificial Intelligence.Austin,USA,2000:584-589.
  • 7Chieu H L,Ng H T.A Maximum Entropy Approach to Information Extraction from Semi-Structured and Free Text // Proc of the 18th National Conference on Artificial Intelligence.Edmonton,Canada,2002:786-791.
  • 8Finn A.A Multi-Level Boundary Classification Approach to Information Extraction // Proc of the 15th European Conference on Machine Learning.Pisa,Italy,2004:111-122.
  • 9Zhang Zhu.Weakly-Supervised Relation Classification for Information Extraction // Proc of the 13th ACM International Conference on Information and Knowledge Management.Washington,USA,2004:581-588.
  • 10Wallach H M.Conditional Random Fields:An Introduction.Technical Report,MS-CIS-04-21,Philadelphia,USA:University of Philadelphia.Department of Computer and Information Science,2004.

同被引文献97

引证文献4

二级引证文献12

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部