C2C电子商务网站交易信息抽取工具的研究与实现

Research and Implementation of a Transaction Information Extraction Tool for C2C E-commerce Sites

下载PDF

导出

摘要研究淘宝网和百度有啊这两个国内有代表性的C2C电子商务平台上的销售记录及其用户信息的抽取.针对两个网站上的店铺销售数据,设计一个基于JerichoHtmlParser的、以Html数据标签为地标的Web数据抽取算法;针对两个网站上的用户信息,设计一个基于正则表达式的Web数据抽取算法.设计实现了一个Web抽取系统,可以按不同的抽取规则实现对不同站点上数据的抽取.最后通过对上述2个平台上实际数据的抽取,验证了设计方案的有效性,实验证实了所设计的原型系统具有较高查全率和准确率. Taobao and Youa are representative C2C E-commerce platforms in China at present.This paper studies how to extract information from transaction record pages and user registration pages on these two platforms.According to the sales records and user registration information on the two sites,two Web data extraction algorithms are designed.One is JerichoHtmlParser-based and uses Html tag as landmark,the other is based on regular expression matching.A Web information extraction system which can extract data from different sites by different extraction rules is designed and implemented.To prove the validity of the algorithm,some experiments have been done.The results show that the prototype system has higher recall rate and accuracy rate.

作者王鸿伟吴扬扬

机构地区泉州师范学院数学与计算机科学学院华侨大学计算机学院

出处《泉州师范学院学报》 2010年第4期12-17,共6页 Journal of Quanzhou Normal University

关键词 WEB数据抽取 C2C电子商务正则表达式 Web data extraction C2C E-commerce regular expression

分类号 TP391.1 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献5

1ARASU Arvind,GARCIA-MOLINA Hector.Extracting structured data from Web pages[C].New York:Proc of the Int Conf on Management of Data,2003:3372348.
2杨少华,林海略,韩燕波.Automatic data extraction from template-generated Web pages[J].Journal of Software,2008,19(2):209-223.
3邓斌,邵培基,夏国恩.基于Choquet积分的HMM商品信息抽取方法[J].系统工程,2008,26(12):110-114. 被引量：6
4于鲁波,陈超.互联网商品信息抽取技术[J].计算机工程,2008,34(5):274-276. 被引量：5
5Liu Bing.Web数据挖掘[M].余勇,薛贵荣,韩定一译.北京:清华大学出版社,2009.

二级参考文献23

1Doorenbos R B, Etzioni O. A scalable comparisonshopping agent for the world wide web (Technical report UW-CSE-96-01-03 ) [ Z ]. University of Washington, 1996, (18) : 283-294.
2Seymore K, MeCallum A. Learning hidden Markov model structure for information extraction[A]. Proceedings of the AAAI' 99[C]. 1999 : 37-42.
3Freitag D, McCallum A. Information extraction with HMM structures learned by stochastic optimization [A]. Proceedings of the eighteenth conference on artificial intelligence[C]. Edmonton.. AAAI Press, 2002:584-589.
4Sugeno M. Fuzzy measures and fuzzy integrals: a survey[M]. New York : North Holland, 1977: 89-102.
5Grabisch M, Sugeno M. Multi-attribute classification using fuzzy integral[A]. The First IEEE Conference on Fuzzy Systems[C]. San Diego,USA, 1992:47-54.
6Keller J M,Osborn J. Training the fuzzy integral[J].International Journal of Approximate Reasoning, 1996,15(1),1-24.
7Grabisch M. Fuzzy integrals as a generalized class of order fitters[A]. EurSymp Satellite Remote Sensing [C]. Rome,Italy, 1994:128-136.
8Rabiner L R. A tutorial on hidden Markov models and selected applications in speech recognition[J]. Proceedings of the IEEE, 1989,77 (2) : 257-286.
9Baum L E, et al. A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains[J]. Annals of Mathematical Statistics, 1970,41 (1) : 164-171.
10Mohamed M A, Gader P. Generalized Hidden Markov Models - Part I:Theoretical Frameworks [J]. IEEE Transactions on Fuzzy Systems, 2000, 8:67-81.

共引文献10

1苗蕊,刘鲁,刘志明.基于隐马尔可夫模型的突发事件新闻报道的爆发性分析[J].系统工程,2010,28(8):89-95. 被引量：4
2夏超,徐德华.一种改进的贝叶斯邮件过滤算法[J].计算机与现代化,2010(10):125-128. 被引量：2
3宋洁,张娜,刘艳柳,顾军华.基于XML的WEB信息自动抽取方法的研究[J].河北工业大学学报,2010,39(5):73-77.
4欧阳佳,林丕源.基于DBSCAN算法的网页正文提取[J].计算机工程,2011,37(3):64-66. 被引量：6
5谭龙江.基于信息抽取的电子商务联盟系统[J].鸡西大学学报（综合版）,2011,11(2):49-50.
6詹沐清,卢荣华.基于Web抽取技术的陶瓷产品信息的应用分析[J].中国科技信息,2012(24):80-81.
7章玲,周德群.基于模糊积分的高科技行业技术创新投入成效研究[J].研究与发展管理,2013,25(5):81-89. 被引量：3
8尤薇佳,苗蕊,刘鲁.C2C市场网商发展模式及其影响因素研究[J].管理学报,2013,10(12):1833-1838.
9赵晓永,王磊.电商网页中商品规格信息自动抽取方法研究[J].计算机工程与应用,2017,53(24):168-171. 被引量：4
10王宁,陈湧,郭玮,仲秋雁,王延章.基于知识元的突发事件案例信息抽取方法[J].系统工程,2014,32(12):133-139. 被引量：10

1张志强,李天柱,张波,陈少飞,郝亚南.基于文档结构的信息抽取规则的描述语言比较研究[J].河北大学学报（自然科学版）,2004,24(2):212-218.
2二手笔记本怎么选?[J].电脑知识与技术（过刊）,2004(8):47-47.
3王莉韦.163易主：迷雾重重[J].深圳周刊,2000(4):30-31.
4王鸿伟.基于JerichoHTMLParser的html信息抽取[J].赤峰学院学报（自然科学版）,2010,26(10):166-168.
5王辰越.关闭“有啊” 百度向“生活”摆渡[J].中国经济周刊,2011(16):62-63.
6秦茜.规避与淘宝正面竞争百度“有啊”更靠拢搜索引擎[J].IT时代周刊,2010(6):66-67.
7网络交易，百度“有啊”[J].世界发明,2008(11):95-95.
8赵福军.百度有啊缘何出师未捷身先死[J].电脑爱好者,2011(9):8-9.
9刘涛,吴雅茹.信息抽取工具MetaSeeker介绍及应用举例[J].魅力中国,2014(20):241-241.
10潘群华,李明禄,张重庆,张文哲,伍民友.无线传感器网络中的数据查询[J].小型微型计算机系统,2007,28(8):1357-1361. 被引量：4

泉州师范学院学报

2010年第4期

浏览历史

内容加载中请稍等...

C2C电子商务网站交易信息抽取工具的研究与实现

参考文献5

二级参考文献23

共引文献10

相关作者

相关机构

相关主题

浏览历史