页面图文模型与元素特征归纳

Picture-text webpage model and page element feature induction

下载PDF

导出

摘要针对以图文内容为核心的页面信息抽取,以形式化的方式提出了对页面进行元素分析的理论模型。通过定义基础元素集与变换规则,页面图文模型简化了页面DOM树结构,并展现出页面内元素的图文特征。在此基础上,通过定义元素分类相似度,从页面图文模型的元素特征中进行优选,归纳最佳分类特征,提出并实现了获取最佳分类特征集与识别阈值的算法。实验结果表明,页面图文模型简化了页面元素的规模,特征集归纳算法能够在较小的学习成本下获得理想的分类精度。 According to the graphic-text content as the core of the page information extraction, this paper in a formal way forward on the page for elemental analysis of theoretical model. Through the definition of basic elements and rules of transformation, graphic-text page model with tree structure to show the page elements within the text and graphic features. The graphic-text page model elements in many features, by defining the elements classification of similarity, is proposed in this paper to obtain the best classification feature set and the recognition threshold method and gives the algorithm implementation. The experimental results show that, the graphic-text page model simplifies the page element size, feature set in smaller learning costs induction can achieve ideal classification accuracy.

作者于龙王金龙

机构地区解放军理工大学

出处《计算机工程与科学》 CSCD 北大核心 2013年第4期136-143,共8页 Computer Engineering & Science

基金国家863计划资助项目(2010AA012404)

关键词页面信息抽取页面元素图文模型特征归纳 web extraction web page element i picture-text model feature induction

分类号 TP393.09 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献16

1Hofmann T. Probabilistic latent semantic indexing[C]//Proc of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 1999: 50-57.
2Chen J, Zhou B, Shi J. Function based object towards web- site aclaptation[C] ff Proc of the 10th International World Wide Web Conference, 2001.-587-596.
3Cai D, Yu S, Wen J R, et al. VIPS A visionbased page seg mentation algorithm[R]. Microsoft Technical Report (MSR- TR-2003-79), 2003.
4Huang W, David G. Context-based page unit recommenda- tion for web-based sensemaking tasks[C]//Proc of the 14th International Conference on Intelligent User Interfaces, 2009 : 107-116.
5Gupta A, Kumar A. Mayank mobile web: Web manipulation for small displays using multi-level hierarchy page segmenta- tion[C]//Proc of the 4th International Conference on Mobile Technology, Applications, and Systems, and the 1st Interna- tional Symposium on Computer Human Interaction in Mobile Technology, 2007 599-606.
6Xiao Xiang-ye, Luo Qiong, Hong Dan, et al. Slicing-tree based web page transformation for small displays[C]//Proc of the 14th ACM International Conference on Information and Knowledge Management, 2005:303-304.
7W3C document object model[EB/OL]. [2011-10-16]. ht- tp://www, w3. org/DOM.
8Vinee[ G, Web page dora node characterization and its appli cation to page segmentation[C]//Proc of the 3rd IEEE Inter national Conference on Internet Multimedia Services Archi tecture and Applications, 2009 325-330.
9Yin X, Lee W S. Using link analysis to improve layout on mo- bile devices[C]ff Proe of the 13th International World Wide Web Conference, 2004:338-344.
10Hattori G, Hoashi K. Robust web page segmentation for mobile terminal using content-distances and page layout in- formation[C]//Proc of the 16th International Conference on World Wide Web, 2007361-370.

1王卫平,盛秋华.基于观点挖掘的笔记本电脑评论分析系统[J].计算机系统应用,2012,21(9):10-13.
2刘肖冰.浅谈网页艺术设计[J].安阳师范学院学报,2005(5):88-89. 被引量：6
3王荣洋,鞠久朋,李寿山,周国栋.基于CRFs的评价对象抽取特征研究[J].中文信息学报,2012,26(2):56-61. 被引量：38
4张继春,徐斌,杨建国.曲柄连杆机构计算机辅助设计系统的开发[J].机械工程师,2006(2):92-94. 被引量：1
5张奇,郝志峰,温雯,蔡瑞初.基于互信息度量的Web信息抽取[J].计算机应用与软件,2013,30(12):15-18. 被引量：5
6彭文滔,叶飞跃,李霞,员红娟.信息抽取中基于DOM树的过滤器方法的研究[J].微计算机信息,2008,24(30):217-219. 被引量：4
7杜寒.“80后”与物联网共成长[J].条码与信息系统,2013(6):25-26.
8李文亮,刘竹松,陈王景.基于SOA的科研管理系统的分析与设计[J].计算机技术与发展,2010,20(5):234-237. 被引量：14
9朱青,吕晓旭.基于机器学习的HTML标题抽取[J].微计算机信息,2010,26(9):15-16. 被引量：4
10叶飞跃,刘兴坤.一种图形数据的存储和查询方案[J].系统工程理论与实践,1998,18(9):136-138. 被引量：4

计算机工程与科学

2013年第4期

浏览历史

内容加载中请稍等...

页面图文模型与元素特征归纳

参考文献16

相关作者

相关机构

相关主题

浏览历史