摘要
针对由模板生成的购物信息网页,且根据其网页信息量大,网页结构复杂的特点,提出了一种不使用复杂的学习规则,而将购物信息从模板网页中抽取出来的方法。研究内容包括定义网页模板和网页的信息抽取模板,设计用于快速构建模板的模板语言,并提出一种基于模板语言抽取内容的模型。实验结果表明,在标准的450个网页的测试集下,所提方法的召回率相比抽取问题算法(EXALG)提高了12%;在250个网页的测试集下,召回率相比基于视觉信息和标签结构的包装器生成器(ViNTs)方法和增加自动信息抽取和视觉感知(ViPER)方法分别提升了7.4%,0.2%;准确率相比ViNTs方法和ViPER方法分别提升了5.2%,0.2%。基于快速构建模板的信息抽取方法的召回率和准确率都有很大提升,使得购物信息检索和购物比价系统中的网页分析的准确性和信息召回率得到很大的改进。
Concerning the shopping information Web page constructed by template, and the large number of Web information and complex Web structure, this paper studied how to extract the shopping information from the Web page template by not using the complex learning rule. The paper defined the Web page template and the extraction template of Web page and designed template language that was used to construct the template. This paper also gave a model of extraction based on template. The experimental results show that the recall rate of the proposed method is 12% higher than the Extraction problem Algorithm (EXALG) by testing the standard 450 Web pages; the results also show that the recall rate of this method is 7.4% higher than Visual information and Tag structure based wrapper generator (ViNTs) method and 0.2% higher than Augmenting automatic information extraction with visual perceptions (VIPER) method and the accuracy rate of this method is 5.2% higher than ViNTs method and 0.2% higher than VIPER method by testing the standard 250 Web pages. The recall rate and the accuracy rate of the extraction method based on the rapid construction template are improved a lot which makes the accuracy of the Web page analysis and the recall rate of the information in the shopping information retrieval and the shopping comparison system improve a lot.
出处
《计算机应用》
CSCD
北大核心
2014年第3期733-737,753,共6页
journal of Computer Applications
基金
国家自然科学基金资助项目(61262088)
关键词
模板
电子商务
信息抽取
购物信息
商品
template
electronic commerce
information extraction
shopping information
goods