基于快速构建模板的购物信息抽取方法被引量：3

Shopping information extraction method based on rapid construction of template

下载PDF

导出

摘要针对由模板生成的购物信息网页,且根据其网页信息量大,网页结构复杂的特点,提出了一种不使用复杂的学习规则,而将购物信息从模板网页中抽取出来的方法。研究内容包括定义网页模板和网页的信息抽取模板,设计用于快速构建模板的模板语言,并提出一种基于模板语言抽取内容的模型。实验结果表明,在标准的450个网页的测试集下,所提方法的召回率相比抽取问题算法(EXALG)提高了12%;在250个网页的测试集下,召回率相比基于视觉信息和标签结构的包装器生成器(ViNTs)方法和增加自动信息抽取和视觉感知(ViPER)方法分别提升了7.4%,0.2%;准确率相比ViNTs方法和ViPER方法分别提升了5.2%,0.2%。基于快速构建模板的信息抽取方法的召回率和准确率都有很大提升,使得购物信息检索和购物比价系统中的网页分析的准确性和信息召回率得到很大的改进。 Concerning the shopping information Web page constructed by template, and the large number of Web information and complex Web structure, this paper studied how to extract the shopping information from the Web page template by not using the complex learning rule. The paper defined the Web page template and the extraction template of Web page and designed template language that was used to construct the template. This paper also gave a model of extraction based on template. The experimental results show that the recall rate of the proposed method is 12% higher than the Extraction problem Algorithm （EXALG） by testing the standard 450 Web pages; the results also show that the recall rate of this method is 7.4% higher than Visual information and Tag structure based wrapper generator （ViNTs） method and 0.2% higher than Augmenting automatic information extraction with visual perceptions （VIPER） method and the accuracy rate of this method is 5.2% higher than ViNTs method and 0.2% higher than VIPER method by testing the standard 250 Web pages. The recall rate and the accuracy rate of the extraction method based on the rapid construction template are improved a lot which makes the accuracy of the Web page analysis and the recall rate of the information in the shopping information retrieval and the shopping comparison system improve a lot.

作者李萍朱建波周立新廖彬

机构地区北京大学软件与微电子学院新疆大学信息科学与工程学院

出处《计算机应用》 CSCD 北大核心 2014年第3期733-737,753,共6页 journal of Computer Applications

基金国家自然科学基金资助项目(61262088)

关键词模板电子商务信息抽取购物信息商品 template electronic commerce information extraction shopping information goods

分类号 TP391.3 [自动化与计算机技术—计算机应用技术] TP18 [自动化与计算机技术—控制理论与控制工程]

引文网络
相关文献

参考文献19

1WANG J, LOCHOVSKY F H. Data-rich section extraction from HT- ML pages [ C]// Proceedings of the Third International Conference on Web Information Systems Engineering. Washington, DC: IEEE Computer Society, 2002:313 - 322.
2CHANG C H, HSU C N, LUI S C. Automatic information extraction from semi-structured Web pages by pattern discovery [ J]. Decision Support Systems, 2003, 35(1) : 129 - 147.
3李保利,陈玉忠,俞士汶.信息抽取研究综述[J].计算机工程与应用,2003,39(10):1-5. 被引量：178
4耿焕同,宋庆席,何宏强.一种基于视觉分块的Web信息抽取方法研究[J].情报理论与实践,2009,32(3):106-109. 被引量：4
5EMBLEY D W, CAMPBELL D M, SMITH R D, et al. Ontology- based extraction and structuring of information from data-rich un- structured documents [ C]//Proceedings of the Seventh Intemational Conference on Information and Knowledge Management. New York: ACM Press, 1998:52-59.
6陆科进,李新颖.基于Ontology的文本信息抽取[J].计算机应用研究,2003,20(7):46-48. 被引量：18
7刘云中,林亚平,陈治平.基于隐马尔可夫模型的文本信息抽取[J].系统仿真学报,2004,16(3):507-510. 被引量：51
8刘辉,陈静玉,徐学洲.基于模板流程配置的Web信息抽取[J].计算机工程,2008,34(20):55-57. 被引量：5
9李素建,王厚峰,俞士汶,辛乘胜.关键词自动标引的最大熵模型应用研究[J].计算机学报,2004,27(9):1192-1197. 被引量：92
10林亚平,刘云中,周顺先,陈治平,蔡立军.基于最大熵的隐马尔可夫模型文本信息抽取[J].电子学报,2005,33(2):236-240. 被引量：48

二级参考文献132

1胡东东,孟小峰.一种基于树结构的Web数据自动抽取方法[J].计算机研究与发展,2004,41(10):1607-1613. 被引量：21
2SODERLAND S. Learning information extraction rules for semistructured and free text [J]. Journal of Machine Learning, 1999, 34 ( 1 ) : 233-272.
3CHANG Chiahui , KAYED M, GIRGIS M R, et al. A survey of Web information extraction systems [ J]. IEEE Trans. on Knowledge and Data Engineering, 2006, 18 (10): 1411- 1428.
4BaDICa A , BaDICa C, POPESCU E. Application of logic wrappers to hierarchical data extraction from HTML [ C]. EPIA 2007, 2007.
5KUSHMERICK N, WELD D, DOORENBOS R. Wrapper induction for information extraction [ C ] //Proc. 15th Int'l Conf. Artificial Intelligence (IJCAI) , 1997.
6RAPOSO J, PAN A, ALVAREZ M, et .al. Automatically generating labeled examples for Web wrapper maintenance [C ] //Proc. 2005 IEEE/WIC/ACM International Conference on Web Intelligence, 2005.
7ZHAO H, MENG W, WU Z, et al. Fully automatic wrapper generation for search engines [ C ] // Proc. the 14th International Conference on World Wide World, 2005.
8LIU B, GROSSMAN R, ZHAI Y. Mining data records from Web pages [C]. KDD' 03, 2003.
9KUSHMERICK N. Wrapper induction: efficiency and expressiveness [ J ]. Artificial Intelligence Journal, 2000, 118 ( 1 - 2) : 15-68.
10SAHUGUET A, AZAVAN F. A wysiwyg Web wrapper factory for minute-made wrappers [ EB/OL ]. http: //db. upenn. edu/DL/www8/index.html.

共引文献472

1孔静静,于琦,李敬华,于彤,张竹绿,田野,祖雅琪.实体抽取综述及其在中医药领域的应用[J].世界科学技术-中医药现代化,2022,24(8):2957-2963. 被引量：4
2步一,薛睿,孟凡,黄文彬.知识图谱的关键技术及其在情报学中的应用[J].情报学进展,2022(1):349-384. 被引量：1
3张博凯,李想.基于知识图谱的Android端农技智能问答系统研究[J].农业机械学报,2021,52(S01):164-171. 被引量：11
4张雪英,闾国年,叶鹏.大数据地理信息系统:框架、技术与挑战[J].现代测绘,2020(6):1-8. 被引量：8
5贾钰峰,章蓬伟,邵小青,张玉茜.印刷维吾尔文识别后处理[J].智能计算机与应用,2020(4):239-242.
6沈芳婷,于艳华,李志强,李劼.基于Attention-Comprehension OpenTag的人物属性抽取算法[J].新一代信息技术,2022,5(6):1-5.
7王睿,张洁,张由仪,于禛,姚天昉.基于混合模型的中文命名实体抽取系统[J].清华大学学报（自然科学版）,2005,45(S1):1908-1914. 被引量：10
8叶正,林鸿飞,苏绥,刘菁菁.基于支持向量机的人物属性抽取[J].计算机研究与发展,2007,44(z2):271-275. 被引量：11
9岳国伟,梁永全.基于Agent的Web页面结构化信息抽取[J].计算机研究与发展,2007,44(z2):344-349.
10张向喆,王明辉,赵洪波,王起山,潘玉春.生物医学文本中命名实体识别研究[J].上海交通大学学报（农业科学版）,2010,28(2):132-139. 被引量：6

同被引文献22

1玉素甫.艾白都拉,阿布都热依木.沙力.现代维语语料库的词类标注研究[J].民族语文,2005(4):63-66. 被引量：7
2陈鹏,古丽拉.阿东别克.隐马尔可夫模型在维吾尔语词性标注中的应用[J].电脑知识与技术,2006,1(4):127-128. 被引量：1
3洪铭材,张阔,唐杰,李涓子.基于条件随机场(CRFs)的中文词性标注方法[J].计算机科学,2006,33(10):148-151. 被引量：56
4罗刚,王振东.自己动手写网络爬虫[M].北京:清华大学出版社,2012:39-64.
5Jiao Z, Yan X, Sun J, et al. Web Content Extraction Technology [ M ~//Computer Engineering and Networking. Springer International Publishing, 2014 : 365 - 373.
6Sun F, Song D, Liao L. DOM based content extraction via text density [ C ]//Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval. Beijing : ACM Press ,2011:245 - 254.
7LUO Q Y, YANG Y S, SUN B F. Integrated decision -making of resident travel mode and route based on prospect theory[ C ]//Proceedings of the 2011 International Conference on Transportation,Mechanical,and Electrical Engineering. Washington,DC:IEEE Computer Society,2011:1822 - 1825.
8Wang J, Lochovsky F H. Data- rich section extraction from HTML pages[ C]//proeeedings of the Third International Conference on Web Information Systems Engineering. Washington, DC :IEEE Computer Society ,2002:313 - 322.
9Chang C H, Hsu C N, Lui S C. Automatic information extraction from semi - struclured Web pages by pattern discovery [ J ]. Decision Support Systems,2003,35 ( 1 ) :129- 147.
10刘兵.Web数据挖掘[M].北京:清华大学出版社,2009:12-42.

引证文献3

1刘全志,于治楼.基于Heritrix和Jsoup的信息抽取系统的设计与实现[J].山东师范大学学报（自然科学版）,2015,30(2):16-19. 被引量：2
2李萍,杨勇,赛买提.艾力,任鸽.基于HMM的维吾尔语词性标注研究[J].现代计算机,2017,23(5):11-14. 被引量：1
3杨贤,唐超兰,李航.基于文本块密度与标签路径等特征的正文提取[J].广东工业大学学报,2018,35(2):51-56. 被引量：1

二级引证文献4

1贾钰峰,章蓬伟,邵小青,张玉茜.印刷维吾尔文识别后处理[J].智能计算机与应用,2020(4):239-242.
2王督,蔡永香,李博涵,刘远刚.油气行业垂直搜索引擎关键问题解决方案[J].计算机系统应用,2018,27(12):18-24.
3许清媛,刘韦声.基于爬虫和LeanCloud数据存储的双语阅读平台设计[J].电子设计工程,2018,26(2):35-38. 被引量：2
4严金承,王运锋.基于模板和SVM协同工作的网页去噪方法[J].计算机科学与应用,2020,10(1):51-59.

1李艳稳,施化吉.基于RFID的人员定位技术在商店信息管理中的应用[J].无线通信技术,2013,22(3):49-53. 被引量：1
2网络购物,将迎来“可视化”时代?[J].华东科技,2011(12):67-67.
3优惠无限时时尽享诺基亚“无限购享”[J].数字生活,2011(8):84-85.
4顾成杰,张顺颐,杜安源.结合粗糙集和禁忌搜索的网络流量特征选择[J].智能系统学报,2011,6(3):254-260.
5陈丽芳.基于Apriori算法的购物篮分析[J].重庆工商大学学报（自然科学版）,2014,31(5):84-89. 被引量：11
6李爱国.基于Cookie的购物车设计与实现[J].信息技术,2013,37(6):60-62. 被引量：2
7雷晖.族谱网个人信息网页的设计与实现[J].信息与电脑（理论版）,2010(8):74-74.
8张建明,陈婉.数据库设计过程中的ER方法与实践[J].计算机世界月刊,1989(8):56-58.
9柯伟扬,郭立君,张荣,王亚东.基于局部突出性稠密块匹配的人体重现[J].计算机工程,2016,42(6):274-279.
10王翔.数据库技术[J].程序员,2006(8):12-13.

计算机应用

2014年第3期

浏览历史

内容加载中请稍等...

基于快速构建模板的购物信息抽取方法被引量：3

参考文献19

二级参考文献132

共引文献472

同被引文献22

引证文献3

二级引证文献4

相关作者

相关机构

相关主题

浏览历史

基于快速构建模板的购物信息抽取方法 被引量：3

参考文献19

二级参考文献132

共引文献472

同被引文献22

引证文献3

二级引证文献4

相关作者

相关机构

相关主题

浏览历史

基于快速构建模板的购物信息抽取方法被引量：3