期刊文献+

结合主动学习的多记录网页属性抽取方法 被引量:1

A Multi-record Webpage Attribute Extraction Method Combining Active Learning
下载PDF
导出
摘要 属性抽取可分为对齐和语义标注两个过程,现有对齐方法中部分含有相同标签不同语义的属性会错分到同一个组,而且为了提高语义标注的精度,通常需要大量的人工标注训练集.为此,文中提出结合主动学习的多记录网页属性抽取方法.针对属性错分问题,引入属性的浅层语义,减少相同标签语义不一致的影响.在语义标注阶段,基于网页的文本、视觉和全局特征,采用基于主动学习的SVM分类方法获得带有语义的结构化数据.同时在主动学习的策略选择方面,通过引入样本整体信息,构建基于不确定性度量的策略,选择语义分类预测不准的样本进行标注.实验表明,在论坛、微博等多个数据集上,相比现有方法,文中方法抽取效果更好. The attribute extraction process can be separated into two phases, alignment and annotation. In the existing alignment methods, different semantic attributes are mistakenly aligned into the same group. Furthermore, to improve the accuracy of semantic annotation, time-consuming manual annotation is oftenintroduced to construct training set. To solve this problem, a multi-record webpage attribute extraction method combining active learning is presented. As for the problem of wrong attribute alignment, shallow semantic is integrated into the alignment approach to relieve the influence of same tags with different semantics. In the semantic annotation phase, textual, visual and global features are extracted for semantic classification and an active learning based SVM classifier is applied to extract structural data. Moreover, a new sample selection strategy is proposed by introducing the global sample information, and more informative samples with lower confidences are selected to be labeled. The experimental results on BBS and microblog datasets confirm the superiority the proposed method.
出处 《模式识别与人工智能》 EI CSCD 北大核心 2016年第8期673-681,共9页 Pattern Recognition and Artificial Intelligence
基金 国家自然科学基金青年基金项目(No.61300105) 教育部博士点基金联合项目(No.2012351410010) 福建省科技重大专项项目(No.2013H6012) 福州市科技计划项目(No.2013-PT-45 2012-G-113)资助~~
关键词 属性抽取 语义分类 主动学习 Attribute Extraction Semantic Classification Active Learning
  • 相关文献

参考文献1

二级参考文献15

  • 1中国互联网络信息中心.第32次中国互联网络发展状况统计报告[EB/OL].http://www.cnnic.net.cn/hlwfzyi/hlwxzbg/hlwtjbg/20130717_40664.htm,2014-02-04.
  • 2Pretzsch S, Muthmann K, Schil A. FODEX-Towards Generic Data Extraction from Web Forums//Proc of the 26th International Con- ference on Advanced Information Networking and Applications. Fukuoka, Japan, 2012 : 821-826.
  • 3Liu W, Yan H L, Xiao J G. Automatically Extracting User Reviews from Forum Sites. Computers and Mathematics with Applications,2011, 62(7) : 2779-2792.
  • 4Liu J, Song X Y, Jiang J T, et al. An Unsupervised Method for Au- thor Extraction from Web Pages Containing User-Generated Content //Proe of the 21st ACM International Conference on Information and Knowledge Management. Maui, USA, 2012:2387-2390.
  • 5Song X Y, Liu J, Cao Y B, et al. Automatic Extraction of Web Da- ta Records Containing User-Generated Content // Proe of the 19th ACM International Conference on Information and Knowledge Man- agement. Toronto, Canada, 2010:39-48.
  • 6Yang J M, Cai R, Wang Y D, et al. Incorporating Site-Level Knowledge to Extract Structured Data from Web Forums// Proe of the 18th International Conference on World Wide Web. Madrid, Spain, 2009:181-190.
  • 7Van der Meer ,1, Frasinear F. Automatic Review Identification on the Web Using Pattern Recognition. Software: Practice and Experi- ence, 2013, 43(12): 1415-1436.
  • 8Yin X X, Tan W Z, Li X, et al. Automatic Extraction of Clickable Structured Web Contents for Name Entity Queries// Proc of the 19th International Conference on World Wide Web. Raleigh, USA, 2010:991-1000.
  • 9Hong J L, Tan E X, Fanzi F. Data Extraction for Search Engine Using Safe Matching// Proc of the 24th Australasian Joint Confer- ence on Artificial Intelligence. Perth, Australia, 2011 : 759-768.
  • 10Zhao H K, Meng W Y, Wu Z H, et al. Fully Automatic Wrapper Generation for Search Engines // Proc of the 14th International Conference on World Wide Web. Chiba, Japan, 2005:66-75.

共引文献6

同被引文献14

引证文献1

二级引证文献4

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部