期刊文献+

基于快速构建模板的购物信息抽取方法 被引量:3

Shopping information extraction method based on rapid construction of template
下载PDF
导出
摘要 针对由模板生成的购物信息网页,且根据其网页信息量大,网页结构复杂的特点,提出了一种不使用复杂的学习规则,而将购物信息从模板网页中抽取出来的方法。研究内容包括定义网页模板和网页的信息抽取模板,设计用于快速构建模板的模板语言,并提出一种基于模板语言抽取内容的模型。实验结果表明,在标准的450个网页的测试集下,所提方法的召回率相比抽取问题算法(EXALG)提高了12%;在250个网页的测试集下,召回率相比基于视觉信息和标签结构的包装器生成器(ViNTs)方法和增加自动信息抽取和视觉感知(ViPER)方法分别提升了7.4%,0.2%;准确率相比ViNTs方法和ViPER方法分别提升了5.2%,0.2%。基于快速构建模板的信息抽取方法的召回率和准确率都有很大提升,使得购物信息检索和购物比价系统中的网页分析的准确性和信息召回率得到很大的改进。 Concerning the shopping information Web page constructed by template, and the large number of Web information and complex Web structure, this paper studied how to extract the shopping information from the Web page template by not using the complex learning rule. The paper defined the Web page template and the extraction template of Web page and designed template language that was used to construct the template. This paper also gave a model of extraction based on template. The experimental results show that the recall rate of the proposed method is 12% higher than the Extraction problem Algorithm (EXALG) by testing the standard 450 Web pages; the results also show that the recall rate of this method is 7.4% higher than Visual information and Tag structure based wrapper generator (ViNTs) method and 0.2% higher than Augmenting automatic information extraction with visual perceptions (VIPER) method and the accuracy rate of this method is 5.2% higher than ViNTs method and 0.2% higher than VIPER method by testing the standard 250 Web pages. The recall rate and the accuracy rate of the extraction method based on the rapid construction template are improved a lot which makes the accuracy of the Web page analysis and the recall rate of the information in the shopping information retrieval and the shopping comparison system improve a lot.
出处 《计算机应用》 CSCD 北大核心 2014年第3期733-737,753,共6页 journal of Computer Applications
基金 国家自然科学基金资助项目(61262088)
关键词 模板 电子商务 信息抽取 购物信息 商品 template electronic commerce information extraction shopping information goods
  • 相关文献

参考文献19

二级参考文献132

  • 1胡东东,孟小峰.一种基于树结构的Web数据自动抽取方法[J].计算机研究与发展,2004,41(10):1607-1613. 被引量:21
  • 2SODERLAND S. Learning information extraction rules for semistructured and free text [J]. Journal of Machine Learning, 1999, 34 ( 1 ) : 233-272.
  • 3CHANG Chiahui , KAYED M, GIRGIS M R, et al. A survey of Web information extraction systems [ J]. IEEE Trans. on Knowledge and Data Engineering, 2006, 18 (10): 1411- 1428.
  • 4BaDICa A , BaDICa C, POPESCU E. Application of logic wrappers to hierarchical data extraction from HTML [ C]. EPIA 2007, 2007.
  • 5KUSHMERICK N, WELD D, DOORENBOS R. Wrapper induction for information extraction [ C ] //Proc. 15th Int'l Conf. Artificial Intelligence (IJCAI) , 1997.
  • 6RAPOSO J, PAN A, ALVAREZ M, et .al. Automatically generating labeled examples for Web wrapper maintenance [C ] //Proc. 2005 IEEE/WIC/ACM International Conference on Web Intelligence, 2005.
  • 7ZHAO H, MENG W, WU Z, et al. Fully automatic wrapper generation for search engines [ C ] // Proc. the 14th International Conference on World Wide World, 2005.
  • 8LIU B, GROSSMAN R, ZHAI Y. Mining data records from Web pages [C]. KDD' 03, 2003.
  • 9KUSHMERICK N. Wrapper induction: efficiency and expressiveness [ J ]. Artificial Intelligence Journal, 2000, 118 ( 1 - 2) : 15-68.
  • 10SAHUGUET A, AZAVAN F. A wysiwyg Web wrapper factory for minute-made wrappers [ EB/OL ]. http: //db. upenn. edu/DL/www8/index.html.

共引文献472

同被引文献22

  • 1玉素甫.艾白都拉,阿布都热依木.沙力.现代维语语料库的词类标注研究[J].民族语文,2005(4):63-66. 被引量:7
  • 2陈鹏,古丽拉.阿东别克.隐马尔可夫模型在维吾尔语词性标注中的应用[J].电脑知识与技术,2006,1(4):127-128. 被引量:1
  • 3洪铭材,张阔,唐杰,李涓子.基于条件随机场(CRFs)的中文词性标注方法[J].计算机科学,2006,33(10):148-151. 被引量:56
  • 4罗刚,王振东.自己动手写网络爬虫[M].北京:清华大学出版社,2012:39-64.
  • 5Jiao Z, Yan X, Sun J, et al. Web Content Extraction Technology [ M ~//Computer Engineering and Networking. Springer International Publishing, 2014 : 365 - 373.
  • 6Sun F, Song D, Liao L. DOM based content extraction via text density [ C ]//Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval. Beijing : ACM Press ,2011:245 - 254.
  • 7LUO Q Y, YANG Y S, SUN B F. Integrated decision -making of resident travel mode and route based on prospect theory[ C ]//Proceedings of the 2011 International Conference on Transportation,Mechanical,and Electrical Engineering. Washington,DC:IEEE Computer Society,2011:1822 - 1825.
  • 8Wang J, Lochovsky F H. Data- rich section extraction from HTML pages[ C]//proeeedings of the Third International Conference on Web Information Systems Engineering. Washington, DC :IEEE Computer Society ,2002:313 - 322.
  • 9Chang C H, Hsu C N, Lui S C. Automatic information extraction from semi - struclured Web pages by pattern discovery [ J ]. Decision Support Systems,2003,35 ( 1 ) :129- 147.
  • 10刘兵.Web数据挖掘[M].北京:清华大学出版社,2009:12-42.

引证文献3

二级引证文献4

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部